wiki:waue/2010/1125
close Warning: Can't synchronize with repository "(default)" (Unsupported version control system "svn": /usr/lib/python2.7/dist-packages/libsvn/_fs.so: failed to map segment from shared object: Cannot allocate memory). Look in the Trac log for more information.

Version 2 (modified by waue, 15 years ago) (diff)

--

nutch 1.2 測試
並加測 protocal : ftp, file , 功能: pdf , url-filter

File 測試

搜尋時,不會自動列出該目錄的內容並從而深入進去,需要檔案一個一個指定於url.txt中,並且file 無法跟 http 一起使用

FTP 測試

ok , 深度也沒問題,但是某些 pdf , word 無法解析,但 html , txt 都 ok

過濾器 crawl-urlfilter.txt

解析器 tika

用 Apache Tika 理解信息内容