= Nutch 研究與隨筆 = == 前言 == - 目前開發NutchEz 已經可以運作了,但都是基本功能,也找出某些問題 - 希望在完整的看完Nutch的官方網頁後,得到更好的靈感與改進方式 == 更多指令 == === readdb === - read / dump crawl db - Usage: !CrawlDbReader (-stats | -dump | -topN [] | -url ) - -stats [-sort] print overall statistics to System.out {{{ $ nutch readdb /tmp/search/crawldb -stats 09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720 09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680 09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1002 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68 }}} - -dump [-format normal|csv ] dump the whole db to a text file in {{{ $ nutch readdb /tmp/search/crawldb/ -dump ./dump $ vim ./dump/part-00000 }}} - -url print information on to System.out {{{ $ nutch readdb /tmp/search/crawldb/ -url http://www.nchc.org.tw/tw/ URL: http://www.nchc.org.tw/tw/ Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Jul 09 14:34:48 CST 2009 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 3.1152809 Signature: ce0202bbd593b09b86ce8a9aa991b321 Metadata: _pst_: success(1), lastModified=0 $ nutch readdb /tmp/search/crawldb/ -url http://www.nchc.org.tw URL: http://www.nchc.org.tw not found }}} - -topN [] dump top urls sorted by score to === inject === - inject new urls into the database - Usage: Injector === readlinkdb === - read / dump link db - Usage: !LinkDbReader {-dump | -url ) {{{ $ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump $ vim ./dump/part-00000 }}} === readseg === - read / dump segment data - Usage: !SegmentReader (-dump ... | -list ... | -get ...) [general options] - !SegmentReader -dump [general options] {{{ $ nutch readseg -dump /tmp/search/segments/20090609143444/ ./dump/ $ vim ./dump/dump }}} - !SegmentReader -list ( ... | -dir ) [general options] {{{ $ nutch readseg -list /tmp/search/segments/20090609143444/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20090609143444 1 2009-06-09T14:34:48 2009-06-09T14:34:48 1 1 }}} - !SegmentReader -get [general options] {{{ $ nutch readseg -get /tmp/search/segments/20090609143444/ http://bioinfo.nchc.org.tw/ }}} === updatedb === - update crawl db from segments after fetching - Usage: !CrawlDb (-dir | ...) [-force] [-normalize] [-filter] [-noAdditions] {{{ $ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/ }}} === dedup === - remove duplicates from a set of segment indexes - Usage: !DeleteDuplicates ... {{{ $ nutch dedup /tmp/search/indexes/ }}} == 筆記 ==