Changes between Version 5 and Version 6 of waue/2010/1029
- Timestamp:
- Oct 29, 2010, 5:03:57 PM (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
waue/2010/1029
v5 v6 9 9 [[PageOutline]] 10 10 11 = = 前言 ==11 = 前言 = 12 12 crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後,剩下來的動作就不做了,hadoop 沒有job ,而go.sh 則 idle永遠顯示 crawling的動作, 無法跑到finish。 13 13 … … 27 27 }}} 28 28 29 = = 手動修復步驟 ==29 = 手動修復步驟 = 30 30 31 31 … … 34 34 }}} 35 35 36 == = index ===36 == index == 37 37 * linkdb tw_yahoo_com_6/linkdb 38 38 … … 46 46 }}} 47 47 48 == = index-lucene ===48 == index-lucene == 49 49 50 50 * index-lucene tw_yahoo_com_6/indexes … … 60 60 61 61 62 == = dedup ===62 == dedup == 63 63 64 64 * dedup 1: urls by time 100.00% … … 74 74 /opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index 75 75 }}} 76 77 78 == download and import== 79 80 {{{ 81 /opt/crawlzilla/nutch/bin/hadoop dfs -get cw_yahoo_5 ~/crawlzilla/archieve/cw_yahoo_5 82 cd ~/crawlzilla/archieve/ 83 echo "0h:0m:0s" >> ./cw_yahoo_5/cw_yahoo_5PassTime 84 echo "5" >> ./cw_yahoo_5/.crawl_depth 85 cd ~/crawlzilla/archieve/cw_yahoo_5/index 86 mv part-00000/* ./ 87 rmdir part-00000/ 88 }}}