Changes between Initial Version and Version 1 of shunfa/2012/0911


Ignore:
Timestamp:
Sep 11, 2012, 4:23:56 PM (12 years ago)
Author:
shunfa
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • shunfa/2012/0911

    v1 v1  
     1[[PageOutline]]
     2= Nutch1.5 + Solr3.6.1 =
     3
     4== 下載 ==
     5 * [http://apache.stu.edu.tw/nutch/1.5/apache-nutch-1.5-bin.tar.gz Nutch1.5]
     6 * [http://ftp.twaren.net/Unix/Web/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz Solr3.6.1]
     7
     8== Steps ==
     9=== 0. 前置環境設定 ===
     10==== 安裝JAVA,確認環境變數 ====
     11{{{
     12$ vim ~/.bashrc
     13}}}
     14加入下列參數(or其他版本的Java路徑)
     15{{{
     16export JAVA_HOME=/usr/lib/jvm/java-6-sun/
     17}}}
     18
     19=== 1. Nutch設定 ===
     20==== 解壓縮nutch安裝包 ====
     21{{{
     22$ tar zxvf apache-nutch-1.5-bin.tar.gz
     23}}}
     24 * 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示
     25
     26==== 確認是否可以執行 ====
     27 * 執行以下指令
     28{{{
     29$ [$NUTCH_HOME]/bin/nutch
     30}}}
     31
     32 * 執行結果
     33{{{
     34Usage: nutch COMMAND
     35where COMMAND is one of:
     36  crawl             one-step crawler for intranets
     37  readdb            read / dump crawl db
     38  mergedb           merge crawldb-s, with optional filtering
     39  readlinkdb        read / dump link db
     40  inject            inject new urls into the database
     41  generate          generate new segments to fetch from crawl db
     42  freegen           generate new segments to fetch from text files
     43  fetch             fetch a segment's pages
     44  parse             parse a segment's pages
     45  readseg           read / dump segment data
     46  mergesegs         merge several segments, with optional filtering and slicing
     47  updatedb          update crawl db from segments after fetching
     48  invertlinks       create a linkdb from parsed segments
     49  mergelinkdb       merge linkdb-s, with optional filtering
     50  solrindex         run the solr indexer on parsed segments and linkdb
     51  solrdedup         remove duplicates from solr
     52  solrclean         remove HTTP 301 and 404 documents from solr
     53  parsechecker      check the parser for a given url
     54  indexchecker      check the indexing filters for a given url
     55  domainstats       calculate domain statistics from crawldb
     56  webgraph          generate a web graph from existing segments
     57  linkrank          run a link analysis program on the generated web graph
     58  scoreupdater      updates the crawldb with linkrank scores
     59  nodedumper        dumps the web graph's node scores
     60  plugin            load a plugin and run one of its classes main()
     61  junit             runs the given JUnit test
     62 or
     63  CLASSNAME         run the class named CLASSNAME
     64Most commands print help when invoked w/o parameters.
     65}}}
     66
     67 * 若出現以上片段,則執行環境OK!
     68
     69==== 設定爬取機器人名稱 ====
     70{{{
     71$ vim [$NUTCH_HOME]/conf/nutch-site.xml
     72}}}
     73 * 加入以下資訊:
     74{{{
     75#!text
     76<property>
     77 <name>http.agent.name</name>
     78 <value>My Nutch Spider</value>
     79</property>
     80}}}
     81
     82==== 設定欲爬取的網址 ====
     83 * 建立網址資料(以爬取http://www.nchc.)
     84{{{
     85$ mkdir -p [$NUTCH_HOME]/urls
     86$ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
     87}}}
     88==== 設定filter ====
     89{{{
     90$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
     91}}}
     92 * 用下列文字取代原始設定
     93{{{
     94#!text
     95# accept anything else
     96+.
     97}}}
     98
     99==== 透過指令執行爬取任務 ====
     100 * 深度3層,每層最多抓取五個文件
     101{{{
     102$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
     103solrUrl is not set, indexing will be skipped...
     104crawl started in: crawl
     105rootUrlDir = urls
     106threads = 10
     107depth = 3
     108solrUrl=null
     109topN = 5
     110Injector: starting at 2012-09-11 16:25:29
     111Injector: crawlDb: crawl/crawldb
     112Injector: urlDir: urls
     113Injector: Converting injected urls to crawl db entries.
     114...(略)
     115}}}
     116
     117 * 出現以下訊息,表示已經抓取完成
     118{{{
     119
     120}}}