| 1 | Apache Nutch README |
|---|
| 2 | |
|---|
| 3 | Important note: Due to licensing issues we cannot provide two libraries that |
|---|
| 4 | are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser |
|---|
| 5 | library we use for parsing PDF files. If you encounter unexpected problems when |
|---|
| 6 | working with PDF files please |
|---|
| 7 | |
|---|
| 8 | 1. download the two missing libraries from: |
|---|
| 9 | http://pdfbox.cvs.sourceforge.net/viewvc/pdfbox/pdfbox/external/ |
|---|
| 10 | |
|---|
| 11 | 2. Put them to directory src/plugin/parse-pdf/lib |
|---|
| 12 | 3. follow the instructions in file src/plugin/parse-pdf/plugin.xml |
|---|
| 13 | 4. Rebuild nutch. |
|---|
| 14 | |
|---|
| 15 | |
|---|
| 16 | |
|---|
| 17 | Interesting files include: |
|---|
| 18 | |
|---|
| 19 | |
|---|
| 20 | docs/api/index.html |
|---|
| 21 | Javadocs for the Nutch software. |
|---|
| 22 | |
|---|
| 23 | CHANGES.txt |
|---|
| 24 | Log of changes to Nutch. |
|---|
| 25 | |
|---|
| 26 | |
|---|
| 27 | For the latest information about Nutch, please visit our website at: |
|---|
| 28 | |
|---|
| 29 | http://lucene.apache.org/nutch/ |
|---|
| 30 | |
|---|
| 31 | and our wiki, at: |
|---|
| 32 | |
|---|
| 33 | http://wiki.apache.org/nutch/ |
|---|
| 34 | |
|---|
| 35 | To get started using Nutch read Tutorial: |
|---|
| 36 | |
|---|
| 37 | http://lucene.apache.org/nutch/tutorial.html |
|---|
| 38 | |
|---|
| 39 | Export Control |
|---|
| 40 | |
|---|
| 41 | This distribution includes cryptographic software. The country in which you |
|---|
| 42 | currently reside may have restrictions on the import, possession, use, and/or |
|---|
| 43 | re-export to another country, of encryption software. BEFORE using any encryption |
|---|
| 44 | software, please check your country's laws, regulations and policies concerning the |
|---|
| 45 | import, possession, or use, and re-export of encryption software, to see if this is |
|---|
| 46 | permitted. See <http://www.wassenaar.org/> for more information. |
|---|
| 47 | |
|---|
| 48 | The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has |
|---|
| 49 | classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which |
|---|
| 50 | includes information security software using or performing cryptographic functions with |
|---|
| 51 | asymmetric algorithms. The form and manner of this Apache Software Foundation |
|---|
| 52 | distribution makes it eligible for export under the License Exception ENC Technology |
|---|
| 53 | Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, |
|---|
| 54 | Section 740.13) for both object code and source code. |
|---|
| 55 | |
|---|
| 56 | The following provides more details on the included cryptographic software: |
|---|
| 57 | |
|---|
| 58 | Apache Nutch uses the PDFBox API in its parse-pdf plugin for extracting textual content |
|---|
| 59 | and metadata from encrypted PDF files. See http://incubator.apache.org/pdfbox/ for more |
|---|
| 60 | details on PDFBox. |
|---|