[[PageOutline]] {{{ #!html

Nutch 完整攻略

}}} = 前言 = * 雖然之前已經測試過了，網路上也有許多人分享過成功的經驗，然而這篇的重點 * 完整的安裝nutch，並解決中文亂碼問題 * 用hadoop的角度來架設nutch * 搜尋引擎不只是找網頁內的資料，也能爬到網頁內的檔案(如pdf,msword) = 環境 = * 目錄 || /opt/nutch || nutch 家目錄|| || /opt/nutch_conf || nutch設定檔 || || /opt/hadoop || hadoop家目錄 || || /opt/conf || hadoop設定檔 || || /tmp/ || 日誌檔、中間檔與暫存檔 || * == step 1 安裝好Hadoop叢集 == * 可以參考這篇 [wiki:0330Hadoop_Lab3hadoop叢集安裝] * 當然單機版也可以，只是這樣就直接安裝nutch更省事囉！單機安裝nutch可以參考這裡[wiki:waue/2009/0406 nutch安裝]，但是設定檔要參考這篇的才完整。 == step 2 下載與安裝 == * 下載 java 1.6 {{{ $ sudo apt-get install sun-java6-bin }}} * 下載 nutch 1.0 (2009/03/28) {{{ $ wget http://ftp.twaren.net/Unix/Web/apache/lucene/nutch/nutch-1.0.tar.gz }}} == step 3 編輯設定檔 == * 所有的設定檔都在 $NUTCH_HOME/conf 下 === 3.1 hadoop-env.sh === 將原本的檔案hadoop-env.sh任意處插入 {{{ #!sh export JAVA_HOME=/usr/lib/jvm/java-6-sun export HADOOP_HOME=/opt/nutch export HADOOP_LOG_DIR=/tmp/nutch/logs export HADOOP_SLAVES=/opt/nutch/conf/slaves }}} === 3.2 hadoop-site.xml === {{{ #!sh fs.default.name gm1.nchc.org.tw:9000 The name of the default file system. Either the literal string "local" or a host:port for NDFS. mapred.job.tracker gm1.nchc.org.tw:9001 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. }}} === 3.3 nutch-site.xml === {{{ #!sh http.agent.name waue HTTP 'User-Agent' request header. http.agent.description MyTest Further description http.agent.url gm1.nchc.org.tw A URL to advertise in the User-Agent header. http.agent.email waue@nchc.org.tw An email address }}} === 3.4 slaves === 其實不用改，因為原本就是localhost {{{ #!sh localhost }}} === 3.5 crawl-urlfilter.txt === 將此檔的兩行改為下面內容 {{{ #!sh # skip URLs containing certain characters as probable queries, etc. -[*!@] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*.*/ }}} == step 4 執行 == === 4.1 編輯url清單 === {{{ $ mkdir urls $ vim urls.txt }}} {{{ #!sh http://lucene.apache.org }}} === 4.2 開啟HDFS === {{{ $ bin/hadoop namenode -format $ bin/start-all.sh }}} === 4.3 上傳清單到HDFS === {{{ $ bin/hadoop -put urls urls }}} === 4.4 執行nutch crawl === {{{ $ bin/nutch crawl urls -dir crawl01 -depth 3 }}} == step 5 web瀏覽 == === 5.1 安裝tomcat === * 下載 {{{ $ cd /opt/ $ wget http://ftp.twaren.net/Unix/Web/apache/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz }}} * 解壓縮 {{{ $ tar -xzvf apache-tomcat-6.0.18.tar.gz $ mv apache-tomcat-6.0.18 tomcat }}} === 5.2 將crawl結果匯入tomcat === {{{ $ cd /opt/nutch $ mkdir web $ cd web $ jar -xvf nutch-1.0.war $ rm nutch-1.0.war $ mv /opt/tomcat/webapps/ROOT /opt/tomcat/webapps/ROOT-ori $ cd /opt/nutch $ mv /opt/nutch/web /opt/tomcat/webapps/ROOT $ vim /opt/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml }}} {{{ #!sh searcher.dir /opt/search }}} 並且修改 /opt/tomcat/conf/server.xml 以修正中文問題 {{{ #!sh }}} === 5.3 瀏覽crawl結果 === {{{ $ /opt/tomcat/bin/startup.sh }}} [http://gm1.nchc.org.tw:8080]