[[PageOutline]] = 相關連結 = * [https://issues.apache.org/jira/browse/NUTCH-427 protocal-smb] * [http://jcifs.samba.org/ jcifs 專案] = 安裝方法 = 1. 下載 protocol-smb 最新檔,解壓縮此檔,假定壓縮後的資料夾名稱為 $pro-smb-dir [https://issues.apache.org/jira/secure/attachment/12442365/protocol-smb-dist.zip] 2. 將 $pro-smb-dir/build/plugins/內的 '''protocol-smb''' 資料夾 (內的 三個檔案 jcifs-1.3.0.jar plugin.xml protocol-smb.jar) 複製到 '''$nutch_home/plugin/''' 去, 3. 修改 $nutch_home/conf/nutch-site.xml {{{ #!xml plugin.includes protocol-smb| other plugins... }}} 4. 將 $pro-smb-dir/conf/smb.properties 複製到 $nutch_home/conf/,並設定數值 5. url 格式為 smb://server/share 6. 進行 nutch 爬取 {{{ #!sh #!/bin/bash crawl_dep=$1 echo $1 function debug_echo () { if [ $? -eq 0 ]; then echo "$1 finished " else echo "$1 is error" exit fi } source /opt/nutchez/nutch/conf/hadoop-env.sh debug_echo "import hadoop-env.sh" echo "delete search (local,hdfs) and urls (hdfs) " rm -rf /home/nutchuser/nutchez/search /opt/nutchez/nutch/bin/hadoop dfs -rmr urls search /opt/nutchez/nutch/bin/hadoop dfs -put /home/nutchuser/nutchez/urls urls # /opt/nutchez/nutch/bin/nutch crawl urls -dir search -depth $crawl_dep -topN 5000 -threads 1000 debug_echo "nutch crawl" # /opt/nutchez/nutch/bin/hadoop dfs -get search /home/nutchuser/nutchez/search debug_echo "download search" # /opt/nutchez/tomcat/bin/shutdown.sh /opt/nutchez/tomcat/bin/startup.sh debug_echo "tomcat restart" }}} = 遇到問題 = {{{ #!txt 2010-05-27 14:07:19,417 WARN org.apache.nutch.crawl.Injector: Skipping smb://140.110.138.179/share:java.net.MalformedURLException: unknown protocol: smb }}} * 試著用以下方法解決: {{{ #!txt a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty("java.protocol.handler.pkgs", "jcifs"); new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write") //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html }}} 並且暴力的把 jcifs.jar 放到 jre/lib/ext/ , nutch/lib/ , nutch 程式執行命令多加-Djava.protocol.handler.pkgs=jcifs 但是此warn 還是沒有解決,以至沒有入口點。 於是到 http://jcifs.samba.org/src/docs/faq.html 自行設計以下的程式來測試 jcifs 專案 [http://jcifs.samba.org/src/jcifs-1.3.14.jar] {{{ #!java import java.net.MalformedURLException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.GregorianCalendar; import jcifs.smb.NtlmAuthenticator; import jcifs.smb.NtlmPasswordAuthentication; import jcifs.smb.SmbException; import jcifs.smb.SmbFile; public class test { /** * @param args * @throws MalformedURLException * @throws SmbException */ public static void main(String[] args) throws MalformedURLException, SmbException { // TODO Auto-generated method stub String domain = "WORKSTATION"; String username = "waue"; String password = "cccccc"; String server = "140.110.138.179"; String share = "share"; String directory = "."; SmbFile[] files = new SmbFile[0]; NtlmPasswordAuthentication auth = new NtlmPasswordAuthentication(domain, username, password); String smburl = String.format("smb://%s/%s/%s/", server, share, directory); // SmbFile file = new SmbFile(smburl, auth); SmbFile file = new SmbFile(smburl); files = file.listFiles(); System.err.println("file : "); for (SmbFile fi : files){ System.err.println(fi.getName()); } } } }}} 得到結果 {{{ file : 【影片】/ 人月神話.pdf 其他/ 【音樂】/ test.txt 【軟體】/ 【照片】/ 【遊戲】/ }}} 證明此jcifs 在我的電腦可以 work,因此是 protocal-smb 與 nutch 之間的問題 = 結論 = * 目前還沒將protocal-smb 與 nutch 整合成功