close
Warning:
Can't synchronize with repository "(default)" (Unsupported version control system "svn": /usr/lib/python2.7/dist-packages/libsvn/_repos.so: failed to map segment from shared object: Cannot allocate memory). Look in the Trac log for more information.
- Timestamp:
-
Jun 9, 2009, 4:33:59 PM (17 years ago)
- Author:
-
waue
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
|
v4
|
v5
|
|
| 8 | 8 | |
| 9 | 9 | === readdb === |
| 10 | | |
| | 10 | - read / dump crawl db |
| | 11 | - Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>) |
| | 12 | - -stats [-sort] print overall statistics to System.out |
| 11 | 13 | {{{ |
| 12 | 14 | $ nutch readdb /tmp/search/crawldb -stats |
| 13 | 15 | |
| 14 | 16 | 09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720 |
| 15 | | |
| 16 | 17 | 09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680 |
| 17 | | |
| 18 | 18 | 09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072 |
| 19 | 19 | 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1002 |
| | 20 | 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68 |
| | 21 | }}} |
| | 22 | - -dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir> |
| | 23 | - -url <url> print information on <url> to System.out |
| | 24 | - -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir> |
| 20 | 25 | |
| 21 | | 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68 |
| | 26 | === inject === |
| | 27 | - inject new urls into the database |
| | 28 | - Usage: Injector <crawldb> <url_dir> |
| 22 | 29 | |
| 23 | | }}} |
| 24 | | === convdb === |
| 25 | | - 沒啥用 |
| 26 | | === inject === |
| 27 | | |
| | 30 | === readlinkdb === |
| | 31 | - read / dump link db |
| | 32 | - Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>) |
| 28 | 33 | {{{ |
| 29 | | |
| 30 | | }}} |
| 31 | | === readlinkdb === |
| 32 | | {{{ |
| 33 | | |
| 34 | | }}} |
| 35 | | |
| 36 | | === === |
| 37 | | {{{ |
| 38 | | |
| | 34 | $ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump |
| | 35 | $ vim ./dump/part-00000 |
| 39 | 36 | }}} |
| 40 | 37 | === readseg === |
| 41 | 38 | - read / dump segment data |
| 42 | | - Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options] |
| 43 | | - SegmentReader -dump <segment_dir> <output> [general options] |
| 44 | | - SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options] |
| 45 | | - SegmentReader -get <segment_dir> <keyValue> [general options] |
| | 39 | - Usage: !SegmentReader (-dump ... | -list ... | -get ...) [general options] |
| | 40 | - !SegmentReader -dump <segment_dir> <output> [general options] |
| | 41 | - !SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options] |
| | 42 | - !SegmentReader -get <segment_dir> <keyValue> [general options] |
| 46 | 43 | {{{ |
| 47 | 44 | |
| … |
… |
|
| 49 | 46 | === updatedb === |
| 50 | 47 | - update crawl db from segments after fetching |
| 51 | | - Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions] |
| | 48 | - Usage: !CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions] |
| 52 | 49 | {{{ |
| 53 | 50 | $ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/ |
| … |
… |
|
| 55 | 52 | === dedup === |
| 56 | 53 | - remove duplicates from a set of segment indexes |
| 57 | | - Usage: DeleteDuplicates <indexes> ... |
| | 54 | - Usage: !DeleteDuplicates <indexes> ... |
| 58 | 55 | {{{ |
| 59 | 56 | $ nutch dedup /tmp/search/indexes/ |
| 60 | | |
| 61 | 57 | }}} |
| 62 | 58 | == 筆記 == |