Context Navigation

← Previous Change
Wiki History
Next Change →

0911

Timestamp:: Sep 11, 2012, 4:23:56 PM (13 years ago)
Author:: shunfa
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

shunfa/2012/0911

                       v1
+[[PageOutline]]
+= Nutch1.5 + Solr3.6.1 =
+== 下載 ==
+ * [http://apache.stu.edu.tw/nutch/1.5/apache-nutch-1.5-bin.tar.gz Nutch1.5]
+ * [http://ftp.twaren.net/Unix/Web/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz Solr3.6.1]
+== Steps ==
+=== 0. 前置環境設定 ===
+==== 安裝JAVA，確認環境變數 ====
+{{{
+$ vim ~/.bashrc
+}}}
+加入下列參數(or其他版本的Java路徑)
+{{{
+export JAVA_HOME=/usr/lib/jvm/java-6-sun/
+}}}
+=== 1. Nutch設定 ===
+==== 解壓縮nutch安裝包 ====
+{{{
+$ tar zxvf apache-nutch-1.5-bin.tar.gz
+}}}
+ * 解壓縮的資料路徑，以下開始以_[$NUTCH_HOME]_表示
+==== 確認是否可以執行 ====
+ * 執行以下指令
+{{{
+$ [$NUTCH_HOME]/bin/nutch
+}}}
+ * 執行結果
+{{{
+Usage: nutch COMMAND
+where COMMAND is one of:
+  crawl             one-step crawler for intranets
+  readdb            read / dump crawl db
+  mergedb           merge crawldb-s, with optional filtering
+  readlinkdb        read / dump link db
+  inject            inject new urls into the database
+  generate          generate new segments to fetch from crawl db
+  freegen           generate new segments to fetch from text files
+  fetch             fetch a segment's pages
+  parse             parse a segment's pages
+  readseg           read / dump segment data
+  mergesegs         merge several segments, with optional filtering and slicing
+  updatedb          update crawl db from segments after fetching
+  invertlinks       create a linkdb from parsed segments
+  mergelinkdb       merge linkdb-s, with optional filtering
+  solrindex         run the solr indexer on parsed segments and linkdb
+  solrdedup         remove duplicates from solr
+  solrclean         remove HTTP 301 and 404 documents from solr
+  parsechecker      check the parser for a given url
+  indexchecker      check the indexing filters for a given url
+  domainstats       calculate domain statistics from crawldb
+  webgraph          generate a web graph from existing segments
+  linkrank          run a link analysis program on the generated web graph
+  scoreupdater      updates the crawldb with linkrank scores
+  nodedumper        dumps the web graph's node scores
+  plugin            load a plugin and run one of its classes main()
+  junit             runs the given JUnit test
+ or
+  CLASSNAME         run the class named CLASSNAME
+Most commands print help when invoked w/o parameters.
+}}}
+ * 若出現以上片段，則執行環境OK!
+==== 設定爬取機器人名稱 ====
+{{{
+$ vim [$NUTCH_HOME]/conf/nutch-site.xml
+}}}
+ * 加入以下資訊：
+{{{
+#!text
+<property>
+ <name>http.agent.name</name>
+ <value>My Nutch Spider</value>
+</property>
+}}}
+==== 設定欲爬取的網址 ====
+ * 建立網址資料(以爬取http://www.nchc.)
+{{{
+$ mkdir -p [$NUTCH_HOME]/urls
+$ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
+}}}
+==== 設定filter ====
+{{{
+$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
+}}}
+ * 用下列文字取代原始設定
+{{{
+#!text
+# accept anything else
++.
+}}}
+==== 透過指令執行爬取任務 ====
+ * 深度3層，每層最多抓取五個文件
+{{{
+$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
+solrUrl is not set, indexing will be skipped...
+crawl started in: crawl
+rootUrlDir = urls
+threads = 10
+depth = 3
+solrUrl=null
+topN = 5
+Injector: starting at 2012-09-11 16:25:29
+Injector: crawlDb: crawl/crawldb
+Injector: urlDir: urls
+Injector: Converting injected urls to crawl db entries.
+...(略)
+}}}
+ * 出現以下訊息，表示已經抓取完成
+{{{
+}}}