wiki:nutch1.2

nutch 1.2 的改變

與nutch 1.0 有許多差異,lucene 的更新,以及索引自庫的關聯方式都不同,try 出以下可能可以完成的方式:

前提

假設索引自庫已經用 bin/nutch crawl 完 http://www.nchc.org.tw/tw/ 後,下載到local 端,路徑為 ~/kkk 。(因此kkk/ 內有 index, indexes,segments,crawldb,linkdb )

tomcat 安裝於 /opt/tomcat/

nutch 安裝於 /opt/nutch/

假設創立一個 0311test的搜尋頁面,

步驟

/opt/tomcat/bin/catalina.sh stop
mkdir /opt/tomcat/webapps/0311test/
cp /opt/nutch/nutch-1.2.war /opt/tomcat/webapps/0311test
cd /opt/tomcat/webapps/0311test/
jar xvf ./nutch-1.2.war 
rm nutch-1.2.war;
cp -rf ~/kkk ./crawl
/opt/tomcat/bin/catalina.sh start 

官方網站 http://wiki.apache.org/nutch/NutchTutorial說,訣竅在於,當我們執行 /opt/tomcat/bin/catalina.sh start 時,本身所在目錄要有 crawl 這個資料夾,nutch 搜尋才會正確對應到索引自庫。

Then visit: http://localhost:8080/0311test

NutchBean 驗證

官網有提到,用 NutchBean 驗證索引庫正確性的方法,原文僅提 (http://wiki.apache.org/nutch/NutchTutorial)

Simplest way to verify the integrity of your crawl is to launch NutchBean  from command line:

 bin/nutch org.apache.nutch.searcher.NutchBean apache 

where apache is the search term (note that NutchBean will only search pages in the crawl directory, so if you named the crawl directory something else, NutchBean will not find any results). After you have verified that the above command returns results you can proceed to setting up the web interface. 

但訣竅在於,執行

bin/nutch org.apache.nutch.searcher.NutchBean [搜尋字串] [hdfs上的索引目錄]

因此執行這個程式時,hadoop 四個身份需已經啟動,並且要搜尋的索引庫已經放在 hdfs 上,才搜的到東西

waue@u1004:/opt/tomcat/webapps/0311test$ /opt/nutch/bin/hadoop dfs -ls

Found 14 items
drwxr-xr-x   - waue supergroup          0 2010-11-24 19:26 /user/waue/crawlbek
drwxr-xr-x   - waue supergroup          0 2010-11-25 09:47 /user/waue/ftp1
drwxr-xr-x   - waue supergroup          0 2010-11-26 15:55 /user/waue/t-hfil2
drwxr-xr-x   - waue supergroup          0 2010-11-25 18:01 /user/waue/t-hfilter
drwxr-xr-x   - waue supergroup          0 2010-11-26 16:13 /user/waue/url

waue@u1004:/opt/tomcat/webapps/0311test$ /opt/nutch/bin/nutch org.apache.nutch.searcher.NutchBean nchc crawlbek

Total hits: 249
 0 20101124184700/http://www.nchc.org.tw/en/
 ... Reserved|Resolution 1024 * 768| webmaster@nchc.narl.org.tw Latest Update ... th ~ December 10 th , 2010@ NCHC, Taiwan More   Southeast Asia International ... 
 1 20101124184929/http://www.nchc.org.tw/en/e_paper/
 ... to Cloud Computing Issue 19:NCHC Establishes a Cloud ... HPC Research - The NCHC’s All New GPU Cluster ... 
 2 20101124184929/http://www.nchc.org.tw/en/about/publication/message/2010_spring.php
 ... Collaborative Research Applied Sciences   ::: About NCHC Home  »  About NCHC  »  Publications  »  NCHC Newsletter  » NCHC Newsletter Spring, 2010, Issue NO ... 
 3 20101124184929/http://www.nchc.org.tw/en/about/
 ... Collaborative Research Applied Sciences   ::: About NCHC Home  » About NCHC With Taiwan's most bountiful ... Reserved|Resolution 1024 * 768| webmaster@
 4 20101124184929/http://www.nchc.org.tw/en/about/job.php
 ... Collaborative Research Applied Sciences   ::: About NCHC Home  »  About NCHC  » Jobs at NCHC If you would like to ... 
 5 20101124185839/http://www.nchc.org.tw/en/about/publication/message/
 ... Collaborative Research Applied Sciences   ::: About NCHC Home  »  About NCHC  »  Publications  » NCHC Newsletter     NCHC Newsletter   2009 Spring   2009Summer     2009 ... 
 6 20101124184929/http://bioinfo.nchc.org.tw/
Bioinformatics Knowledge Database 國網中心生物知識庫與生物計算服務 ... 
 7 20101124184929/http://ecogrid.nchc.org.tw/
 ... were picked as show cases.     NCHC Ecogrid team provided the ... into database in NCHC, consumers can query by a ... 
 8 20101124184929/http://www.nchc.org.tw/en/research/list.php
 ... Director History Publications Jobs at NCHC Driving Directions HPC Services Educational ... Wed, November 24, 2010 ::: About NCHC Areas of Service ... 
 9 20101124184929/http://accta.nchc.org.tw/en/
ACCTA | Login | Home | 中文 |   To protect your account and password security, please click  ... 

Last modified 13 years ago Last modified on Mar 11, 2011, 7:19:18 PM