1 | Apache Nutch README |
---|
2 | |
---|
3 | Important note: Due to licensing issues we cannot provide two libraries that |
---|
4 | are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser |
---|
5 | library we use for parsing PDF files. If you encounter unexpected problems when |
---|
6 | working with PDF files please |
---|
7 | |
---|
8 | 1. download the two missing libraries from: |
---|
9 | http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/ |
---|
10 | |
---|
11 | 2. Put them to directory src/plugin/parse-pdf/lib |
---|
12 | 3. follow the instructions in file src/plugin/parse-pdf/plugin.xml |
---|
13 | 4. Rebuild nutch. |
---|
14 | |
---|
15 | |
---|
16 | |
---|
17 | Interesting files include: |
---|
18 | |
---|
19 | |
---|
20 | docs/api/index.html |
---|
21 | Javadocs for the Nutch software. |
---|
22 | |
---|
23 | CHANGES.txt |
---|
24 | Log of changes to Nutch. |
---|
25 | |
---|
26 | |
---|
27 | For the latest information about Nutch, please visit our website at: |
---|
28 | |
---|
29 | http://lucene.apache.org/nutch/ |
---|
30 | |
---|
31 | and our wiki, at: |
---|
32 | |
---|
33 | http://wiki.apache.org/nutch/ |
---|
34 | |
---|
35 | To get started using Nutch read Tutorial: |
---|
36 | |
---|
37 | http://lucene.apache.org/nutch/tutorial.html |
---|
38 | |
---|
39 | Export Control |
---|
40 | |
---|
41 | This distribution includes cryptographic software. The country in which you |
---|
42 | currently reside may have restrictions on the import, possession, use, and/or |
---|
43 | re-export to another country, of encryption software. BEFORE using any encryption |
---|
44 | software, please check your country's laws, regulations and policies concerning the |
---|
45 | import, possession, or use, and re-export of encryption software, to see if this is |
---|
46 | permitted. See <http://www.wassenaar.org/> for more information. |
---|
47 | |
---|
48 | The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has |
---|
49 | classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which |
---|
50 | includes information security software using or performing cryptographic functions with |
---|
51 | asymmetric algorithms. The form and manner of this Apache Software Foundation |
---|
52 | distribution makes it eligible for export under the License Exception ENC Technology |
---|
53 | Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, |
---|
54 | Section 740.13) for both object code and source code. |
---|
55 | |
---|
56 | The following provides more details on the included cryptographic software: |
---|
57 | |
---|
58 | Apache Nutch uses the PDFBox API in its parse-pdf plugin for extracting textual content |
---|
59 | and metadata from encrypted PDF files. See http://incubator.apache.org/pdfbox/ for more |
---|
60 | details on PDFBox. |
---|