close
Warning:
Can't synchronize with repository "(default)" (Unsupported version control system "svn": /usr/lib/python2.7/dist-packages/libsvn/_delta.so: failed to map segment from shared object: Cannot allocate memory). Look in the Trac log for more information.
- Timestamp:
-
Oct 29, 2010, 5:03:57 PM (15 years ago)
- Author:
-
waue
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
v5
|
v6
|
|
9 | 9 | [[PageOutline]] |
10 | 10 | |
11 | | == 前言 == |
| 11 | = 前言 = |
12 | 12 | crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後,剩下來的動作就不做了,hadoop 沒有job ,而go.sh 則 idle永遠顯示 crawling的動作, 無法跑到finish。 |
13 | 13 | |
… |
… |
|
27 | 27 | }}} |
28 | 28 | |
29 | | == 手動修復步驟 == |
| 29 | = 手動修復步驟 = |
30 | 30 | |
31 | 31 | |
… |
… |
|
34 | 34 | }}} |
35 | 35 | |
36 | | === index === |
| 36 | == index == |
37 | 37 | * linkdb tw_yahoo_com_6/linkdb |
38 | 38 | |
… |
… |
|
46 | 46 | }}} |
47 | 47 | |
48 | | === index-lucene === |
| 48 | == index-lucene == |
49 | 49 | |
50 | 50 | * index-lucene tw_yahoo_com_6/indexes |
… |
… |
|
60 | 60 | |
61 | 61 | |
62 | | === dedup === |
| 62 | == dedup == |
63 | 63 | |
64 | 64 | * dedup 1: urls by time 100.00% |
… |
… |
|
74 | 74 | /opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index |
75 | 75 | }}} |
| 76 | |
| 77 | |
| 78 | == download and import== |
| 79 | |
| 80 | {{{ |
| 81 | /opt/crawlzilla/nutch/bin/hadoop dfs -get cw_yahoo_5 ~/crawlzilla/archieve/cw_yahoo_5 |
| 82 | cd ~/crawlzilla/archieve/ |
| 83 | echo "0h:0m:0s" >> ./cw_yahoo_5/cw_yahoo_5PassTime |
| 84 | echo "5" >> ./cw_yahoo_5/.crawl_depth |
| 85 | cd ~/crawlzilla/archieve/cw_yahoo_5/index |
| 86 | mv part-00000/* ./ |
| 87 | rmdir part-00000/ |
| 88 | }}} |