wiki:waue/2011/0727

Context Navigation

Nutch 1.3

[intro]
[setup]
1. get
2. deploy
3. local
4. [setup solr]
[execute]
1. run-once
FAQ

[intro]

7 June 2011 - Apache Nutch 1.3 Released

ref :
- http://wiki.apache.org/nutch/NutchTutorial
- http://wiki.apache.org/nutch/RunningNutchAndSolr

[setup]

get

get nutch

extract to /opt/nutch-1.3

cd /opt/nutch-1.3
ant

deploy

可將 bin/nutch 與 nutch-1.3.job 放到 hadoop 與之整合

local

cd /opt/nutch-1.3/runtime/local

bin/nutch (inject)

export JAVA_HOME="/usr/lib/jvm/java-6-sun"

conf/nutch-site.xml (inject)

<configuration>
<property>
  <name>http.agent.name</name>
  <value>waue_test</value>
</property>
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
  <name>http.robots.agents</name>
  <value>nutch</value>
</property>
<property>
  <name>http.agent.url</name>
  <value>waue_test</value>
</property>
<property>
  <name>http.agent.email</name>
  <value>waue_test</value>
</property>
<property>
  <name>http.agent.version</name>
  <value>waue_test</value>
</property>
</configuration>

conf/regex-urlfilter.txt (replace) (1.2 conf/crawl-urlfilter.txt)

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[*!]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

[setup solr]

get solr

extract to /opt/solr-3.3.0/

cd /opt/solr-3.3.0/
cp /opt/nutch-1.3/conf/schema.xml /opt/solr-3.3.0/example/solr/conf/
cd /opt/solr-3.3.0/example/
java -jar start.jar

[execute]

mkdir urls ; echo "http://lucene.apache.org/nutch/" >urls/url.txt
bin/nutch crawl urls -dir crawl2 -depth 2 -topN 50

you will get only 3 directories.
```
crawldb  linkdb  segments
```

finally , connect nutch result to solr

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

using web admin to check

http://localhost:8983/solr/admin/

run-once

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

FAQ

Q1 : where is or how to build the "war" file ?

A1 :

Simple answer here is no.

Both the web app and Lucene index which previously shipped with Nutch has
been deprecated.

Please have a a look at the new tutorial [1] and the site for more
information on the new functionality and features which ship with Nutch 1.3

[1] http://wiki.apache.org/nutch/RunningNutchAndSolr

Last modified 14 years ago Last modified on Jul 28, 2011, 12:15:31 PM

Download in other formats:

Plain Text