close
Warning:
Can't synchronize with repository "(default)" (Unsupported version control system "svn": /usr/lib/python2.7/dist-packages/libsvn/_core.so: cannot map zero-fill pages: Cannot allocate memory). Look in the Trac log for more information.
- Timestamp:
-
Apr 24, 2009, 6:47:29 PM (17 years ago)
- Author:
-
waue
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
|
v16
|
v17
|
|
| 101 | 101 | = step 3 編輯設定檔 = |
| 102 | 102 | * 所有的設定檔都在 /opt/nutch/conf 下 |
| 103 | | == 3.1 hadoop-env.sh == |
| | 103 | == 3.1 $NUTCH_HOME/conf/hadoop-env.sh == |
| 104 | 104 | * 將原本的檔案hadoop-env.sh任意處填入 |
| | 105 | {{{ |
| | 106 | $ cd /opt/nutch/conf |
| | 107 | $ gedit hadoop-env.sh |
| | 108 | }}} |
| | 109 | |
| 105 | 110 | {{{ |
| 106 | 111 | #!sh |
| … |
… |
|
| 116 | 121 | * 載入環境設定值 |
| 117 | 122 | {{{ |
| 118 | | $ source /opt/nutch/conf/hadoop-env.sh |
| | 123 | $ source ./hadoop-env.sh |
| 119 | 124 | }}} |
| 120 | 125 | * ps:強烈建議寫入 /etc/bash.bashrc 中比較萬無一失!! |
| 121 | 126 | |
| 122 | 127 | |
| 123 | | == 3.2 conf/nutch-site.xml == |
| | 128 | == 3.2 $NUTCH_HOME/conf/nutch-site.xml == |
| 124 | 129 | * 重要的設定檔,新增了必要的內容於內,然而想要瞭解更多參數資訊,請見nutch-default.xml |
| 125 | 130 | {{{ |
| 126 | | $ vim conf/nutch-site.xml |
| | 131 | $ gedit nutch-site.xml |
| 127 | 132 | }}} |
| 128 | 133 | {{{ |
| … |
… |
|
| 198 | 203 | }}} |
| 199 | 204 | |
| 200 | | == 3.3 crawl-urlfilter.txt == |
| | 205 | == 3.3 $NUTCH_HOME/conf/crawl-urlfilter.txt == |
| 201 | 206 | * 重新編輯爬檔規則,此檔重要在於若設定不好,則爬出來的結果幾乎是空的,也就是說最後你的搜尋引擎都找不到資料啦! |
| 202 | 207 | {{{ |
| 203 | | $ vim conf/crawl-urlfilter.txt |
| | 208 | $ gedit ./crawl-urlfilter.txt |
| 204 | 209 | }}} |
| 205 | 210 | {{{ |
| … |
… |
|
| 221 | 226 | == 4.1 編輯url清單 == |
| 222 | 227 | {{{ |
| | 228 | $ cd /opt/nutch |
| 223 | 229 | $ mkdir urls |
| 224 | 230 | $ echo "http://www.nchc.org.tw" >> ./urls/urls.txt |