[66] | 1 | Nutch Change Log |
---|
| 2 | |
---|
| 3 | Release 1.0 - 2009-03-23 |
---|
| 4 | |
---|
| 5 | 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab) |
---|
| 6 | |
---|
| 7 | 2. NUTCH-443 - Allow parsers to return multiple Parse objects. |
---|
| 8 | (Dogacan Guney et al, via ab) |
---|
| 9 | |
---|
| 10 | 3. NUTCH-393 - Indexer should handle null documents returned by filters. |
---|
| 11 | (Eelco Lempsink via ab) |
---|
| 12 | |
---|
| 13 | 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren) |
---|
| 14 | |
---|
| 15 | 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other |
---|
| 16 | bots in robots.txt (Dogacan Guney via siren) |
---|
| 17 | |
---|
| 18 | 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren) |
---|
| 19 | |
---|
| 20 | 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin |
---|
| 21 | (siren) |
---|
| 22 | |
---|
| 23 | 8. NUTCH-161 - Change Plain text parser to |
---|
| 24 | use parser.character.encoding.default property for fall back encoding |
---|
| 25 | (KuroSaka TeruHiko, siren) |
---|
| 26 | |
---|
| 27 | 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of |
---|
| 28 | unmodified content. (ab) |
---|
| 29 | |
---|
| 30 | 10. NUTCH-392 - OutputFormat implementations should pass on Progressable. |
---|
| 31 | (cutting via ab) |
---|
| 32 | |
---|
| 33 | 11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan) |
---|
| 34 | |
---|
| 35 | 12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed |
---|
| 36 | up the rss parser (dogacan via mattmann). This update is a fix and semantics |
---|
| 37 | change from the original patch for NUTCH-443. The original patch did not tell |
---|
| 38 | the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch |
---|
| 39 | datums. This patch addresses that issue. Now, if Fetcher gets a null content, |
---|
| 40 | instead of pushing an empty content, it filters the null content. |
---|
| 41 | |
---|
| 42 | 13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of |
---|
| 43 | Parse object. (Gal Nitzan via dogacan) |
---|
| 44 | |
---|
| 45 | 14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains |
---|
| 46 | some query parameters. (Emmanuel Joke via dogacan) |
---|
| 47 | |
---|
| 48 | 15. NUTCH-502 - Bug in SegmentReader causes infinite loop. |
---|
| 49 | (Ilya Vishnevsky via dogacan) |
---|
| 50 | |
---|
| 51 | 16. NUTCH-444 Possibly use a different library to parse RSS feed for improved |
---|
| 52 | performance and compatibility. This patch introduced a new plugin, feed, |
---|
| 53 | that includes an index filter and a parse plugin for feeds that uses ROME. |
---|
| 54 | There was discussion to remove parse-rss, in light of the feed plugin, |
---|
| 55 | however, this patch does not explicitly remove parse-rss. (dogacan, mattmann) |
---|
| 56 | |
---|
| 57 | 17. NUTCH-471 - Fix synchronization in NutchBean creation. |
---|
| 58 | (Enis Soztutar via dogacan) |
---|
| 59 | |
---|
| 60 | 18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab) |
---|
| 61 | |
---|
| 62 | 19. NUTCH-468 - Scoring filter should distribute score to all outlinks at |
---|
| 63 | once. (dogacan) |
---|
| 64 | |
---|
| 65 | 20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan) |
---|
| 66 | |
---|
| 67 | 21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in |
---|
| 68 | DomContentUtils...Spider Trap. (kubes) |
---|
| 69 | |
---|
| 70 | 22. NUTCH-434 - Replace usage of ObjectWritable with something based on |
---|
| 71 | GenericWritable. (dogacan) |
---|
| 72 | |
---|
| 73 | 23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan) |
---|
| 74 | |
---|
| 75 | 24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation. |
---|
| 76 | (Espen Amble Kolstad via dogacan) |
---|
| 77 | |
---|
| 78 | 25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml. |
---|
| 79 | (Emmanuel Joke via dogacan) |
---|
| 80 | |
---|
| 81 | 26. NUTCH-503 - Generator exits incorrectly for small fetchlists. |
---|
| 82 | (Vishal Shah via dogacan) |
---|
| 83 | |
---|
| 84 | 27. NUTCH-505 - Outlink urls should be validated. (dogacan) |
---|
| 85 | |
---|
| 86 | 28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan) |
---|
| 87 | |
---|
| 88 | 29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan) |
---|
| 89 | |
---|
| 90 | 30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan) |
---|
| 91 | |
---|
| 92 | 30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan) |
---|
| 93 | |
---|
| 94 | 31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan). |
---|
| 95 | |
---|
| 96 | 32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining. |
---|
| 97 | (Enis Soztutar via dogacan) |
---|
| 98 | |
---|
| 99 | 33. NUTCH-516 - Next fetch time is not set when it is a |
---|
| 100 | CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan) |
---|
| 101 | |
---|
| 102 | 34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException |
---|
| 103 | when trying to rerun dedup on a segment. (Vishal Shah via dogacan) |
---|
| 104 | |
---|
| 105 | 35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS. |
---|
| 106 | (dogacan) Note: There is a bigger problem, i.e how to deal |
---|
| 107 | with redirected pages, and this issue can be considered as a band-aid |
---|
| 108 | for the time being. See NUTCH-273 and NUTCH-353 for more details. |
---|
| 109 | |
---|
| 110 | 36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and |
---|
| 111 | inlinks list. (Emmanuel Joke via dogacan) |
---|
| 112 | |
---|
| 113 | 37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during |
---|
| 114 | parse. (dogacan) |
---|
| 115 | |
---|
| 116 | 38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan) |
---|
| 117 | |
---|
| 118 | 39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan) |
---|
| 119 | |
---|
| 120 | 40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds |
---|
| 121 | domain-related utilities. (Enis Soztutar via dogacan) |
---|
| 122 | |
---|
| 123 | 41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable |
---|
| 124 | release (2.1). (Dawid Weiss via dogacan) |
---|
| 125 | |
---|
| 126 | 42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every |
---|
| 127 | request. (Dawid Weiss via dogacan) |
---|
| 128 | |
---|
| 129 | 43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time. |
---|
| 130 | (Emmanuel Joke via dogacan) |
---|
| 131 | |
---|
| 132 | 44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan) |
---|
| 133 | |
---|
| 134 | 45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan) |
---|
| 135 | |
---|
| 136 | 46. NUTCH-554 - Generator throws IOException on invalid urls. |
---|
| 137 | (Brian Whitman via ab) |
---|
| 138 | |
---|
| 139 | 47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child. |
---|
| 140 | (Emmanuel Joke via dogacan) |
---|
| 141 | |
---|
| 142 | 48. NUTCH-25 - needs 'character encoding' detector. |
---|
| 143 | (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan) |
---|
| 144 | |
---|
| 145 | 49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated |
---|
| 146 | to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan) |
---|
| 147 | |
---|
| 148 | 50. NUTCH-562 - Port mime type framework to use Tika mime detection framework. |
---|
| 149 | (mattmann) |
---|
| 150 | |
---|
| 151 | 51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink |
---|
| 152 | list. (Emmanuel Joke, Marcin Okraszewski via kubes) |
---|
| 153 | |
---|
| 154 | 52. NUTCH-501 - Implement a different caching mechanism for objects cached in |
---|
| 155 | configuration. (dogacan) |
---|
| 156 | |
---|
| 157 | 53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes) |
---|
| 158 | |
---|
| 159 | 54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes) |
---|
| 160 | |
---|
| 161 | 55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm. |
---|
| 162 | (dogacan, kubes via dogacan) |
---|
| 163 | |
---|
| 164 | 56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat. |
---|
| 165 | (Emmanuel Joke via dogacan) |
---|
| 166 | |
---|
| 167 | 57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan) |
---|
| 168 | |
---|
| 169 | 58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan) |
---|
| 170 | |
---|
| 171 | 59. NUTCH-574 - Including inlink anchor text in index can create irrelevant |
---|
| 172 | search results. Created index-anchor plugin, removed functionality from |
---|
| 173 | index-basic plugin. For backwards compatibility, add index-anchor plugin to |
---|
| 174 | nutch-site.xml plugin.includes. (kubes) |
---|
| 175 | |
---|
| 176 | 60. NUTCH-581 - DistributedSearch does not update search servers added to |
---|
| 177 | search-servers.txt on the fly. (Rohan Mehta via kubes) |
---|
| 178 | |
---|
| 179 | 61. NUTCH-586 - Add option to run compiled classes without job file |
---|
| 180 | (enis via ab) |
---|
| 181 | |
---|
| 182 | 62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy |
---|
| 183 | server. (Susam Pal via dogacan) |
---|
| 184 | |
---|
| 185 | 63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab) |
---|
| 186 | |
---|
| 187 | 64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format |
---|
| 188 | (Emmanuel Joke via ab) |
---|
| 189 | |
---|
| 190 | 65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab) |
---|
| 191 | |
---|
| 192 | 66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab) |
---|
| 193 | |
---|
| 194 | 67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren) |
---|
| 195 | |
---|
| 196 | 68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes) |
---|
| 197 | |
---|
| 198 | 69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab) |
---|
| 199 | |
---|
| 200 | 70. NUTCH-602 - Allow configurable number of handlers for search servers |
---|
| 201 | (hartbecke via kubes) |
---|
| 202 | |
---|
| 203 | 71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes) |
---|
| 204 | |
---|
| 205 | 72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann) |
---|
| 206 | |
---|
| 207 | 73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes) |
---|
| 208 | |
---|
| 209 | 74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes) |
---|
| 210 | |
---|
| 211 | 75. NUTCH-603 - Add more default url normalizations (kubes) |
---|
| 212 | |
---|
| 213 | 76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes) |
---|
| 214 | |
---|
| 215 | 77. NUTCH-44 - Too many search results, limits max results returned from a |
---|
| 216 | single search. (Emilijan Mirceski and Susam Pal via kubes) |
---|
| 217 | |
---|
| 218 | 78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is |
---|
| 219 | updated to 1.2 version. (dogacan) |
---|
| 220 | |
---|
| 221 | 79. NUTCH-613 - Empty summaries and cached pages (kubes via ab) |
---|
| 222 | |
---|
| 223 | 80. NUTCH-612 - URL filtering was disabled in Generator when invoked |
---|
| 224 | from Crawl (Susam Pal via ab) |
---|
| 225 | |
---|
| 226 | 81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab) |
---|
| 227 | |
---|
| 228 | 82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab) |
---|
| 229 | |
---|
| 230 | 83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab) |
---|
| 231 | |
---|
| 232 | 84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval. |
---|
| 233 | Guard against reprUrl being null. (Emmanuel Joke, ab) |
---|
| 234 | |
---|
| 235 | 85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel |
---|
| 236 | Joke, ab) |
---|
| 237 | |
---|
| 238 | 86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab) |
---|
| 239 | |
---|
| 240 | 87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab) |
---|
| 241 | |
---|
| 242 | 88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API. |
---|
| 243 | (Emmanuel Joke, dogacan, ab) |
---|
| 244 | |
---|
| 245 | 89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a |
---|
| 246 | single slash. (Mark DeSpain via ab) |
---|
| 247 | |
---|
| 248 | 90. NUTCH-500 - Add hadoop masters configuration file into conf folder. |
---|
| 249 | (Emmanuel Joke via kubes) |
---|
| 250 | |
---|
| 251 | 91. NUTCH-596 - ParseSegments parse content even if its not |
---|
| 252 | CrawlDatum.STATUS_FETCH_SUCCESS (dogacan) |
---|
| 253 | |
---|
| 254 | 92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes) |
---|
| 255 | |
---|
| 256 | 93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln |
---|
| 257 | Ritter, ab) |
---|
| 258 | |
---|
| 259 | 94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab) |
---|
| 260 | |
---|
| 261 | 95. NUTCH-645 - Parse-swf unit test failing (ab) |
---|
| 262 | |
---|
| 263 | 96. NUTCH-642 - Unit tests fail when run in non-local mode (ab) |
---|
| 264 | |
---|
| 265 | 97. NUTCH-639 - Change LuceneDocumentWrapper visibility from |
---|
| 266 | private to _public_ (Guillaume Smet via dogacan) |
---|
| 267 | |
---|
| 268 | 98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn |
---|
| 269 | tracking. (dogacan) |
---|
| 270 | |
---|
| 271 | 99. NUTCH-375 - Add support for Content-Encoding: deflated |
---|
| 272 | (Pascal Beis, ab) |
---|
| 273 | |
---|
| 274 | 100. NUTCH-633 - ParseSegment no longer allow reparsing. |
---|
| 275 | (dogacan) |
---|
| 276 | |
---|
| 277 | 101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan) |
---|
| 278 | |
---|
| 279 | 102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann) |
---|
| 280 | |
---|
| 281 | 103. NUTCH-654 - urlfilter-regex's main does not work. |
---|
| 282 | (dogacan) |
---|
| 283 | |
---|
| 284 | 104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE". |
---|
| 285 | (dogacan) |
---|
| 286 | |
---|
| 287 | 105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes) |
---|
| 288 | |
---|
| 289 | 106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes) |
---|
| 290 | |
---|
| 291 | 107. NUTCH-647 - Resolve URLs tool (kubes) |
---|
| 292 | |
---|
| 293 | 108. NUTCH-665 - Search Load Testing Tool (kubes) |
---|
| 294 | |
---|
| 295 | 109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming |
---|
| 296 | (kubes) |
---|
| 297 | |
---|
| 298 | 110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes) |
---|
| 299 | |
---|
| 300 | 111. NUTCH-646 - New Indexing Framework for Nutch. (kubes) |
---|
| 301 | |
---|
| 302 | 112. NUTCH-668 - Domain URL Filter. (kubes) |
---|
| 303 | |
---|
| 304 | 113. NUTCH-594 - Serve Nutch search results in multiple formats including |
---|
| 305 | XML and JSON. (kubes) |
---|
| 306 | |
---|
| 307 | 114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren) |
---|
| 308 | |
---|
| 309 | 115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate |
---|
| 310 | fetch interval correctly. (dogacan) |
---|
| 311 | |
---|
| 312 | 116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic) |
---|
| 313 | |
---|
| 314 | 117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t. |
---|
| 315 | (julien nioche via dogacan) |
---|
| 316 | |
---|
| 317 | 118. NUTCH-681 - parse-mp3 compilation problem. |
---|
| 318 | (Wildan Maulana via dogacan) |
---|
| 319 | |
---|
| 320 | 119. NUTCH-676 - MapWritable is written inefficiently and confusingly. |
---|
| 321 | (dogacan) |
---|
| 322 | |
---|
| 323 | 120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical |
---|
| 324 | digest. (dogacan) |
---|
| 325 | |
---|
| 326 | 121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3. |
---|
| 327 | (Joseph Chen, dogacan) |
---|
| 328 | |
---|
| 329 | 122. NUTCH-682 - SOLR indexer does not set boost on the document. |
---|
| 330 | (julien nioche via dogacan) |
---|
| 331 | |
---|
| 332 | 123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab) |
---|
| 333 | |
---|
| 334 | 124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab) |
---|
| 335 | |
---|
| 336 | 125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab) |
---|
| 337 | |
---|
| 338 | 126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE |
---|
| 339 | (Curtis d'Entremont, ab) |
---|
| 340 | |
---|
| 341 | 127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan) |
---|
| 342 | |
---|
| 343 | 128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException |
---|
| 344 | (Stefan Will, siren) |
---|
| 345 | |
---|
| 346 | 129. NUTCH-691 - Update jakarta poi jars to the most relevant version |
---|
| 347 | (Dmitry Lihachev via siren) |
---|
| 348 | |
---|
| 349 | 130. NUTCH-563 - Include custom fields in BasicQueryFilter |
---|
| 350 | (Julien Nioche via siren) |
---|
| 351 | |
---|
| 352 | 131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin |
---|
| 353 | (Dmitry Lihachev via siren) |
---|
| 354 | |
---|
| 355 | 132. NUTCH-694 - Distributed Search Server fails (siren) |
---|
| 356 | |
---|
| 357 | 133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links |
---|
| 358 | set at cross domain redirects (Remco Verhoef, dogacan via siren) |
---|
| 359 | |
---|
| 360 | 134. NUTCH-247 - Robot parser to restrict (kubes, siren) |
---|
| 361 | |
---|
| 362 | 135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan |
---|
| 363 | via siren) |
---|
| 364 | |
---|
| 365 | 136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan, |
---|
| 366 | Dmitry Lihachev via siren) |
---|
| 367 | |
---|
| 368 | 137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab) |
---|
| 369 | |
---|
| 370 | 138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann, |
---|
| 371 | Doug Cook via ab) |
---|
| 372 | |
---|
| 373 | 139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren) |
---|
| 374 | |
---|
| 375 | 140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren) |
---|
| 376 | |
---|
| 377 | 141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab) |
---|
| 378 | |
---|
| 379 | 142. NUTCH-684 - Dedup support for Solr. (dogacan) |
---|
| 380 | |
---|
| 381 | 143. NUTCH-715 - Subcollection plugin doesn't work with default |
---|
| 382 | subcollections.xml file (Dmitry Lihachev via siren) |
---|
| 383 | |
---|
| 384 | 144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute |
---|
| 385 | |
---|
| 386 | Release 0.9 - 2007-04-02 |
---|
| 387 | |
---|
| 388 | 1. Changed log4j confiquration to log to stdout on commandline |
---|
| 389 | tools (siren) |
---|
| 390 | |
---|
| 391 | 2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren) |
---|
| 392 | |
---|
| 393 | 3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet, |
---|
| 394 | siren) |
---|
| 395 | |
---|
| 396 | 4. Optionally skip pages with abnormally large values of Crawl-Delay |
---|
| 397 | (Dennis Kubes via ab) |
---|
| 398 | |
---|
| 399 | 5. Change readdb -stats to use CombiningCollector (ab) |
---|
| 400 | |
---|
| 401 | 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris |
---|
| 402 | Schneider and Stefan Groschupf via ab) |
---|
| 403 | |
---|
| 404 | 7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying |
---|
| 405 | dependant jars (siren) |
---|
| 406 | |
---|
| 407 | 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files |
---|
| 408 | in parse-plugins.xml (Chris A. Mattmann via siren) |
---|
| 409 | |
---|
| 410 | 9. NUTCH-105 - Network error during robots.txt fetch causes file to |
---|
| 411 | be ignored (Greg Kim via siren) |
---|
| 412 | |
---|
| 413 | 10. NUTCH-367 - DistributedSearch thown ClassCastException (siren) |
---|
| 414 | |
---|
| 415 | 11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing |
---|
| 416 | to the current page (e.g. anchors). (Stefan Groschupf via ab) |
---|
| 417 | |
---|
| 418 | 12. NUTCH-365 - Flexible URL normalization (ab) |
---|
| 419 | |
---|
| 420 | 13. NUTCH-336 - Differentiate between newly discovered pages and newly |
---|
| 421 | injected pages (Chris Schneider via ab) NOTE: this changes the |
---|
| 422 | scoring API, filter implementations need to be updated. |
---|
| 423 | |
---|
| 424 | 14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf |
---|
| 425 | via ab) |
---|
| 426 | |
---|
| 427 | 15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE |
---|
| 428 | (Stefan Groschupf via ab) |
---|
| 429 | |
---|
| 430 | 16. NUTCH-374 - when http.content.limit be set to -1 and |
---|
| 431 | Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing |
---|
| 432 | (King Kong via pkosiorowski) |
---|
| 433 | |
---|
| 434 | 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) |
---|
| 435 | |
---|
| 436 | ****************************** WARNING !!! ******************************** |
---|
| 437 | * This upgrade breaks data format compatibility. A tool 'convertdb' * |
---|
| 438 | * was added to migrate existing CrawlDb-s to the new format. Segment data * |
---|
| 439 | * can be partially migrated using 'mergesegs', however segments will * |
---|
| 440 | * require re-parsing (and consequently re-indexing). * |
---|
| 441 | ****************************** WARNING !!! ******************************** |
---|
| 442 | |
---|
| 443 | 18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of |
---|
| 444 | the algorithm. (ab) |
---|
| 445 | |
---|
| 446 | 19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot |
---|
| 447 | find parser (siren) |
---|
| 448 | |
---|
| 449 | 20. NUTCH-379 - ParseUtil does not pass through the content's URL to the |
---|
| 450 | ParserFactory (Chris A. Mattmann via siren) |
---|
| 451 | |
---|
| 452 | 21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one |
---|
| 453 | partition. (ab) |
---|
| 454 | |
---|
| 455 | 22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren) |
---|
| 456 | |
---|
| 457 | 23. NUTCH-395 - Increase fetching speed (siren) |
---|
| 458 | |
---|
| 459 | 24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order |
---|
| 460 | (reported by Jared Dunne) |
---|
| 461 | |
---|
| 462 | 25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren) |
---|
| 463 | |
---|
| 464 | 26. NUTCH-403 - Make URL filtering optional in Generator (siren) |
---|
| 465 | |
---|
| 466 | 27. NUTCH-405 - Content object is not properly initialized in map method |
---|
| 467 | of ParseSegment (siren) |
---|
| 468 | |
---|
| 469 | 28. NUTCH-362 - Remove parse-text from unsupported filetypes in |
---|
| 470 | parse-plugins.xml (siren) |
---|
| 471 | |
---|
| 472 | 29. NUTCH-305 - Update crawl and url filter lists to exclude |
---|
| 473 | jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan |
---|
| 474 | Neufeind) is also updated (siren) |
---|
| 475 | |
---|
| 476 | 30. NUTCH-406 - Metadata tries to write null values (mattmann) |
---|
| 477 | |
---|
| 478 | 31. NUTCH-415 - Generator should mark selected records in CrawlDb. |
---|
| 479 | Due to increased resource consumption this step is optional. |
---|
| 480 | Application-level locking has been added to prevent concurrent |
---|
| 481 | modification of databases. (ab) |
---|
| 482 | |
---|
| 483 | 32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is |
---|
| 484 | now possible to correctly update CrawlDb from multiple segments. |
---|
| 485 | Introduce new status codes for temporary and permanent |
---|
| 486 | redirection. (ab) |
---|
| 487 | |
---|
| 488 | 33. NUTCH-322 - Fix Fetcher to store redirected pages and to store |
---|
| 489 | protocol-level status. This also should fix NUTCH-273. (ab) |
---|
| 490 | |
---|
| 491 | 34. Change default Fetcher behavior not to follow redirects immediately. |
---|
| 492 | Instead Fetcher will record redirects as new pages to be added to CrawlDb. |
---|
| 493 | This also partially addresses NUTCH-273. (ab) |
---|
| 494 | |
---|
| 495 | 35. Detect and report when Generator creates 0-sized segments. (ab) |
---|
| 496 | |
---|
| 497 | 36. Fix Injector to preserve already existing CrawlDatum if the seed list |
---|
| 498 | being injected also contains such URL. (ab) |
---|
| 499 | |
---|
| 500 | 37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after |
---|
| 501 | skipping bad URLs. (Michael Stack via ab) |
---|
| 502 | |
---|
| 503 | 38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains |
---|
| 504 | Filters that are not in plugin.includes (Stefan Groschupf, siren) |
---|
| 505 | |
---|
| 506 | 39. NUTCH-421 - Allow predeterminate running order of indexing filters |
---|
| 507 | (Alan Tanaman, siren) |
---|
| 508 | |
---|
| 509 | 40. When indexing pages with redirection, drop all intermediate pages and |
---|
| 510 | index only the final page. (ab) |
---|
| 511 | |
---|
| 512 | 41. Upgrade to Hadoop 0.10.1. (ab) |
---|
| 513 | |
---|
| 514 | 42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the |
---|
| 515 | order in which IndexDoc-s are processed. (Dogacan Guney via ab) |
---|
| 516 | |
---|
| 517 | 43. NUTCH-428 - NullPointerException thrown when agent name is not |
---|
| 518 | configured properly. Changed to throw RuntimeException instead. |
---|
| 519 | (siren) |
---|
| 520 | |
---|
| 521 | 44. NUTCH-430 - Integer overflow in HashComparator.compare (siren) |
---|
| 522 | |
---|
| 523 | 45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab) |
---|
| 524 | |
---|
| 525 | 46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs |
---|
| 526 | or indexing from hadoop.io.DataOutputBuffer (siren) |
---|
| 527 | |
---|
| 528 | 47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab) |
---|
| 529 | |
---|
| 530 | 48. NUTCH-390 - Javadoc warnings (mattmann) |
---|
| 531 | |
---|
| 532 | 49. NUTCH-449 - Make junit output format configurable. (nigel via cutting) |
---|
| 533 | |
---|
| 534 | 50. NUTCH-432 - Fix a bug where platform name with spaces would break the |
---|
| 535 | bin/nutch script. (Brian Whitman via ab) |
---|
| 536 | |
---|
| 537 | 51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab) |
---|
| 538 | |
---|
| 539 | 52. NUTCH-167 - Observation of robots "noarchive" directive. (ab) |
---|
| 540 | |
---|
| 541 | 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins |
---|
| 542 | framework to operate properly (Heiko Dietze via mattmann) |
---|
| 543 | |
---|
| 544 | 54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan |
---|
| 545 | Groschupf via kubes) |
---|
| 546 | |
---|
| 547 | 55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL |
---|
| 548 | path is empty (kubes) |
---|
| 549 | |
---|
| 550 | 56. Upgrade to Hadoop 0.12.1 release. (ab) |
---|
| 551 | |
---|
| 552 | 57. NUTCH-246 - Incorrect segment size being generated due to time |
---|
| 553 | synchronization issue (Stefan Groschupf via ab) |
---|
| 554 | |
---|
| 555 | 58. Upgrade to Hadoop 0.12.2 release. (ab) |
---|
| 556 | |
---|
| 557 | 59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael |
---|
| 558 | Stack and Dogacan Guney via kubes) |
---|
| 559 | |
---|
| 560 | Release 0.8 - 2006-07-25 |
---|
| 561 | |
---|
| 562 | 0. Totally new architecture, based on hadoop |
---|
| 563 | [http://lucene.apache.org/hadoop] (cutting) |
---|
| 564 | |
---|
| 565 | 1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross). |
---|
| 566 | |
---|
| 567 | 2. NUTCH-108 - Log hosts that exceed generate.max.per.host. |
---|
| 568 | (Rod Taylor via cutting) |
---|
| 569 | |
---|
| 570 | 3. NUTCH-88 - Enhance ParserFactory plugin selection policy |
---|
| 571 | (jerome) |
---|
| 572 | |
---|
| 573 | 4. NUTCH-124 - Protocol-httpclient does not follow redirects when |
---|
| 574 | fetching robots.txt (cutting) |
---|
| 575 | |
---|
| 576 | 5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?) |
---|
| 577 | (stack@archive.org, cutting) |
---|
| 578 | |
---|
| 579 | 6. NUTCH-114 - Getting number of urls and links from crawldb |
---|
| 580 | (Stefan Groschupf via ab) |
---|
| 581 | |
---|
| 582 | 7. NUTCH-112 - Link in cached.jsp page to cached content is an |
---|
| 583 | absolute link (Chris A. Mattmann via jerome) |
---|
| 584 | |
---|
| 585 | 8. NUTCH-135 - Http header meta data are case insensitive in the |
---|
| 586 | real world (Stefan Groschupf via jerome) |
---|
| 587 | |
---|
| 588 | 9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due |
---|
| 589 | to UTF-8 BOM (KuroSaka TeruHiko via siren) |
---|
| 590 | |
---|
| 591 | 10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab) |
---|
| 592 | |
---|
| 593 | 11. Added support for OpenSearch (cutting) |
---|
| 594 | |
---|
| 595 | 12. NUTCH-142 - NutchConf should use the thread context classloader |
---|
| 596 | (Mike Cannon-Brookes via pkosiorowski) |
---|
| 597 | |
---|
| 598 | 13. NUTCH-160 - Use standard Java Regex library rather than |
---|
| 599 | org.apache.oro.text.regex (Rod Taylor via cutting) |
---|
| 600 | |
---|
| 601 | 14. NUTCH-151 - CommandRunner can hang after the main thread exec is |
---|
| 602 | finished and has inefficient busy loop (Paul Baclace via cutting) |
---|
| 603 | |
---|
| 604 | 15. NUTCH-174 - Problem encountered with ant during compilation |
---|
| 605 | |
---|
| 606 | 16. NUTCH-190 - ParseUtil drops reason for failed parse |
---|
| 607 | (stack@archive.org via ab) |
---|
| 608 | |
---|
| 609 | 17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab) |
---|
| 610 | |
---|
| 611 | 18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab) |
---|
| 612 | |
---|
| 613 | 19. NUTCH-178 - in search.jsp must be session creation "false" |
---|
| 614 | (YourSoft via siren) |
---|
| 615 | |
---|
| 616 | 20. NUTCH-200 - OpenSearch Servlet ist broken |
---|
| 617 | (Marko Bauhardt via siren) |
---|
| 618 | |
---|
| 619 | 21. NUTCH-81 - Webapp only works when deployed in root |
---|
| 620 | (AJ Banck, Michael Nebel via siren) |
---|
| 621 | |
---|
| 622 | 22. NUTCH-139 - Standard metadata property names in the ParseData |
---|
| 623 | metadata (Chris A. Mattmann, jerome) |
---|
| 624 | |
---|
| 625 | 23. NUTCH-192 - Meta data support for CrawlDatum |
---|
| 626 | (Stefan Groschupf via ab) |
---|
| 627 | |
---|
| 628 | 24. NUTCH-52 - Parser plugin for MS Excel files |
---|
| 629 | (Rohit Kulkarni via jerome) |
---|
| 630 | |
---|
| 631 | 25. NUTCH-53 - Parser plugin for Zip files |
---|
| 632 | (Rohit Kulkarni via jerome) |
---|
| 633 | |
---|
| 634 | 26. NUTCH-137 - footer is not displayed in search result page |
---|
| 635 | (KuroSaka TeruHiko via siren) |
---|
| 636 | |
---|
| 637 | 27. NUTCH-118 - FAQ link points to invalid URL |
---|
| 638 | (Steve Betts via siren) |
---|
| 639 | |
---|
| 640 | 28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin) |
---|
| 641 | translation (Ivan Sekulovic via siren) |
---|
| 642 | |
---|
| 643 | 29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf |
---|
| 644 | via cutting) |
---|
| 645 | |
---|
| 646 | 30. NUTCH-140 - Add alias capability in parse-plugins.xml file that |
---|
| 647 | allows mimeType->extensionId mapping (Chris A. Mattmann via jerome) |
---|
| 648 | |
---|
| 649 | 31. NUTCH-214 - Added Links to web site to search mailling list |
---|
| 650 | (Jake Vanderdray via jerome) |
---|
| 651 | |
---|
| 652 | 32. NUTCH-204 - Multiple field values in HitDetails |
---|
| 653 | (Stefan Groschupf via jerome) |
---|
| 654 | |
---|
| 655 | 33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed |
---|
| 656 | to -1 to be consistent with http (jerome) |
---|
| 657 | |
---|
| 658 | 34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren) |
---|
| 659 | |
---|
| 660 | 35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via |
---|
| 661 | pkosiorowski) |
---|
| 662 | |
---|
| 663 | 36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via |
---|
| 664 | jerome) |
---|
| 665 | |
---|
| 666 | 37. NUTCH-229 - Improved handling of plugin folder configuration |
---|
| 667 | (Stefan Groschupf via ab) |
---|
| 668 | |
---|
| 669 | 38. NUTCH-206 - Search server throws InstantiationException (ab) |
---|
| 670 | |
---|
| 671 | 39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt |
---|
| 672 | via ab) |
---|
| 673 | |
---|
| 674 | 40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab) |
---|
| 675 | |
---|
| 676 | 41. Update to lucene 1.9.1 (cutting) |
---|
| 677 | |
---|
| 678 | 42. NUTCH-235 - Duplicate Inlink values (ab) |
---|
| 679 | |
---|
| 680 | 43. NUTCH-234 - Clustering extension code cleanups and a real |
---|
| 681 | JUnit test case for the current implementation (Dawid Weiss via ab) |
---|
| 682 | |
---|
| 683 | 44. NUTCH-210 - Context.xml file for Nutch web application |
---|
| 684 | (Chris A. Mattmann via jerome) |
---|
| 685 | |
---|
| 686 | 45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome) |
---|
| 687 | |
---|
| 688 | 46. NUTCH-232 - Search.jsp has multiple search forms creating |
---|
| 689 | invalid html / incorrect focus function (jerome) |
---|
| 690 | |
---|
| 691 | 47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome) |
---|
| 692 | |
---|
| 693 | 48. NUTCH-244 - Inconsistent handling of property values |
---|
| 694 | boundaries / unable to set db.max.outlinks.per.page to |
---|
| 695 | infinite (jerome) |
---|
| 696 | |
---|
| 697 | 49. NUTCH-245 - DTD for plugin.xml configuration files |
---|
| 698 | (Chris A. Mattmann via jerome) |
---|
| 699 | |
---|
| 700 | 50. NUTCH-250 - Generate to log truncation caused by |
---|
| 701 | generate.max.per.host (Rod Taylor via cutting) |
---|
| 702 | |
---|
| 703 | 51. NUTCH-125 - OpenOffice Parser plugin (ab) |
---|
| 704 | |
---|
| 705 | 52. Switch from using java.io.File to org.apache.hadoop.fs.Path. |
---|
| 706 | (cutting) |
---|
| 707 | |
---|
| 708 | 53. NUTCH-240 - Scoring API: extension point, scoring filters and |
---|
| 709 | an OPIC plugin (ab) |
---|
| 710 | |
---|
| 711 | 54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome) |
---|
| 712 | |
---|
| 713 | 55. NUTCH-268 - Generator and lib-http use different definitions of |
---|
| 714 | "unique host" (ab) |
---|
| 715 | |
---|
| 716 | 56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser |
---|
| 717 | via siren) |
---|
| 718 | |
---|
| 719 | 57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories |
---|
| 720 | (Dennis Kubes via ab) |
---|
| 721 | |
---|
| 722 | 58. NUTCH-201 - Add support for subcollections |
---|
| 723 | (siren) |
---|
| 724 | |
---|
| 725 | 59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown |
---|
| 726 | (Stefan Groschupf via jerome) |
---|
| 727 | |
---|
| 728 | 60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome) |
---|
| 729 | |
---|
| 730 | 61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query |
---|
| 731 | (Stefan Groschupf via jerome) |
---|
| 732 | |
---|
| 733 | 62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters |
---|
| 734 | (stack@archive.org via siren) |
---|
| 735 | |
---|
| 736 | 63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space |
---|
| 737 | (Stefan Neufeind via siren) |
---|
| 738 | |
---|
| 739 | 64. NUTCH-307 - Wrong configured log4j.properties (jerome) |
---|
| 740 | |
---|
| 741 | 65. NUTCH-303 - Logging improvements (jerome) |
---|
| 742 | |
---|
| 743 | 66. NUTCH-308 - Maximum search time limit (ab) |
---|
| 744 | |
---|
| 745 | 67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency |
---|
| 746 | problem (Grant Glouser via siren) |
---|
| 747 | |
---|
| 748 | 68. Update to hadoop-0.4 (Milind Bhandarkar, cutting) |
---|
| 749 | |
---|
| 750 | 69. NUTCH-317 - Clarify what the queryLanguage argument of |
---|
| 751 | Query.parse(...) means (jerome) |
---|
| 752 | |
---|
| 753 | 70. Added alternative experimental web gui in contrib containing |
---|
| 754 | extensions like subcollection, keymatch, user preferences, |
---|
| 755 | caching, implemented mainly using tiles and jstl (siren) |
---|
| 756 | |
---|
| 757 | 71. NUTCH-320 DmozParser does not output list of urls to stdout |
---|
| 758 | but to a log file instead. Original functionality restored. |
---|
| 759 | |
---|
| 760 | 72. NUTCH-271 - Add ability to limit crawling to the set of initially |
---|
| 761 | injected hosts (db.ignore.external.links) (Philippe Eugene, |
---|
| 762 | Stefan Neufeind via ab) |
---|
| 763 | |
---|
| 764 | 73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab) |
---|
| 765 | |
---|
| 766 | 74. NUTCH-327 - Fixed logging directory on cygwin (siren) |
---|
| 767 | |
---|
| 768 | Release 0.7 - 2005-08-17 |
---|
| 769 | |
---|
| 770 | 1. Added support for "type:" in queries. Search results are limited/qualified |
---|
| 771 | by mimetype or its primary type or sub type. For example, |
---|
| 772 | (1) searching with "type:application/pdf" restricts results |
---|
| 773 | to pages which were identified to be of mimetype "application/pdf". |
---|
| 774 | (2) with "type:application", nutch will return pages of |
---|
| 775 | primary type "application". |
---|
| 776 | (3) with "type:pdf", only pages of sub type "pdf" will be listed. |
---|
| 777 | (John Xing, 20050120) |
---|
| 778 | |
---|
| 779 | 2. Added support for "date:" in queries. Last-Modified is indexed. |
---|
| 780 | Search results are restricted by lower and upper date (inclusive) |
---|
| 781 | as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231 |
---|
| 782 | only returns pages with Last-Modified in year 2004. |
---|
| 783 | (John Xing, 20050122) |
---|
| 784 | |
---|
| 785 | 3. Add URLFilter plugin interface and convert existing url filters into |
---|
| 786 | plugins. (John Xing, 20050206) |
---|
| 787 | |
---|
| 788 | 4. Add UpdateSegmentsFromDb tool, which updates the scores and |
---|
| 789 | anchors of existing segments with the current values in the web |
---|
| 790 | db. This is used by CrawlTool, so that pages are now only fetched |
---|
| 791 | once per crawl. (Doug Cutting, 20050221) |
---|
| 792 | |
---|
| 793 | 5. Moved code into org.apache.nutch sub-packages. Changed license to |
---|
| 794 | Apache 2.0. Removed jar files whose licenses do not permit |
---|
| 795 | redistribution by Apache. Disabled compilation of plugins which |
---|
| 796 | require these libraries. (Doug Cutting 20050301) |
---|
| 797 | |
---|
| 798 | 6. Index host and title in separate fields. Host was indexed |
---|
| 799 | previously only as a part of the URL. Title was indexed as an |
---|
| 800 | anchor. Now boosts for matching these fields may be adjusted |
---|
| 801 | separately from boosts for matching anchors and url. Also: move |
---|
| 802 | site indexing to index-basic plugin to minimize the number of |
---|
| 803 | times the URL needs to be parsed; and, stop using anchor analyzer |
---|
| 804 | for anything but anchors. (Piotr Kosiorowski via Doug Cutting |
---|
| 805 | 20050323) |
---|
| 806 | |
---|
| 807 | 7. Add servlet Cached.java that serves cached Content of any mime type. |
---|
| 808 | Slightly modified are web.xml and cached.jsp. |
---|
| 809 | (John Xing, 20050401) |
---|
| 810 | |
---|
| 811 | 8. Add skipCompressedByteArray() to WritableUtils.java. |
---|
| 812 | (John Xing, 20050402) |
---|
| 813 | |
---|
| 814 | 9. Fixes to jsp and static web pages. These now use relative links, |
---|
| 815 | so that the Nutch webapp file can be used in places other than at |
---|
| 816 | the root. Also fixed links to the about and help pages. Bug #32. |
---|
| 817 | (Jerome Charron via cutting, 20050404) |
---|
| 818 | |
---|
| 819 | 10. Added some features to DistributedSearch: new segments can be added |
---|
| 820 | to searchservers without restarting the frontend, defective search |
---|
| 821 | servers are not queried until tey come back online, watchdog keeps |
---|
| 822 | an eye for your searchservers and writes simple statistics. |
---|
| 823 | (Sami Siren, 20050407) |
---|
| 824 | |
---|
| 825 | 11. Fix for bug #4 - Unbalanced quote in query eats all resources. |
---|
| 826 | (Piotr Kosiorowski, Sami Siren, 20050407) |
---|
| 827 | |
---|
| 828 | 12. Close Issue #33 - MIME content type detector (using magic char sequences). |
---|
| 829 | (Jerome Charron and Hari Kodungallur via John Xing, 20050416) |
---|
| 830 | |
---|
| 831 | 13. Add a servlet that implements A9's OpenSearch RSS web service. |
---|
| 832 | (cutting, 20050418) |
---|
| 833 | |
---|
| 834 | 14. Remove references to link analysis from tutorial, and enable |
---|
| 835 | scoring by link count when generating fetchlists and searching. |
---|
| 836 | (cutting, 20040419) |
---|
| 837 | |
---|
| 838 | 15. Make query boosts for host, title, anchor and phrase matches |
---|
| 839 | configurable. (Piotr Kosiorowski via cutting, 20050419) |
---|
| 840 | |
---|
| 841 | 16. Add support for sorting search results and search-time deduping by |
---|
| 842 | fields other than site. |
---|
| 843 | |
---|
| 844 | 17. Automatically convert range queries into cached range filters. |
---|
| 845 | This improves the performance and scalability of, e.g., date range |
---|
| 846 | searching. |
---|
| 847 | |
---|
| 848 | 18. Several methods have been renamed due to misspellings. The old |
---|
| 849 | methods have been deprecated and will be removed before the 1.0 |
---|
| 850 | release. |
---|
| 851 | |
---|
| 852 | |
---|
| 853 | Release 0.6 |
---|
| 854 | |
---|
| 855 | 1. Added clustering-carrot2 plugin, together with introduction of clustering |
---|
| 856 | api and modification to search jsp. (Dawid Weiss via John Xing, 20040809) |
---|
| 857 | |
---|
| 858 | 2. Make a number of changes to NDFS (Nutch Distributed File System) |
---|
| 859 | to fix bugs, add admin tools, etc. |
---|
| 860 | |
---|
| 861 | Also, modify all command line tools so you can indicate whether to |
---|
| 862 | use NDFS or the local filesystem. If you indicate nothing, then |
---|
| 863 | it defaults to the local fs. |
---|
| 864 | |
---|
| 865 | I've used this to do a 35m page crawl via NDFS, distributed over a |
---|
| 866 | dozen machines. (Mike Cafarella) |
---|
| 867 | |
---|
| 868 | 3. Add support for BASE tags in HTML. Outlinks are now correctly |
---|
| 869 | extracted when a BASE tag is present. (cutting) |
---|
| 870 | |
---|
| 871 | 4. Fix two bugs in result pagination. When the last hit on a page |
---|
| 872 | was the last hit overall, the "next" button was sometimes shown |
---|
| 873 | when the "show all" button should be shown instead. Also, in |
---|
| 874 | certain cases, the "show all" button would be shown when the |
---|
| 875 | "next" button should have been shown. (cutting) |
---|
| 876 | |
---|
| 877 | 5. Add config parameter "indexer.max.tokens" that determines the |
---|
| 878 | maximum number of tokens indexed per field. (Andy Hedges via cutting) |
---|
| 879 | |
---|
| 880 | 6. Add parser for mp3 files. (Andy Hedges via cutting) |
---|
| 881 | |
---|
| 882 | 7. Add RegexUrlNormalizer. This is useful for things like stripping |
---|
| 883 | out session IDs from URLs. To use it, add values for |
---|
| 884 | urlnormalizer.class and urlnormalizer.regex.file to your |
---|
| 885 | nutch-site.xml. The RegexUrlNormalizer class extends the |
---|
| 886 | BasicUrlNormalizer, and does basic normalization as well. |
---|
| 887 | (Luke Baker via cutting) |
---|
| 888 | |
---|
| 889 | 8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910) |
---|
| 890 | |
---|
| 891 | 9. Added Polish translation (Andrzej Bialecki, 20040911) |
---|
| 892 | |
---|
| 893 | 10. Added 3 more language profiles to language identifier (ru,hu,pl). |
---|
| 894 | Other changes to language identifier: Porfiles converted to utf8, |
---|
| 895 | added some test cases, changed the similarity calculation. |
---|
| 896 | (Sami Siren, 20040925) |
---|
| 897 | |
---|
| 898 | 11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929) |
---|
| 899 | |
---|
| 900 | 12. Added plugin index-more and more.jsp (John Xing, 20041003) |
---|
| 901 | |
---|
| 902 | 13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced |
---|
| 903 | in DistributedSearch.java. text.jsp is added. (John Xing, 20041006) |
---|
| 904 | |
---|
| 905 | 14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp |
---|
| 906 | (but not search.jsp) with NullPointerException in distributed search. |
---|
| 907 | It seems that this bug appears after "hits per site" stuff is added. |
---|
| 908 | The fix is done in Hit.java, making sure String site is never null. |
---|
| 909 | Hope this fix not have bad effetct on "hits per site" code. |
---|
| 910 | (John Xing, 20041006) |
---|
| 911 | |
---|
| 912 | 15. Fixed a bug that fails fullyDelete() in FileUtil.java for |
---|
| 913 | LocalFileSystem.java. This bug also exposes possible incompleteness |
---|
| 914 | of NDFSFile.java, where a few methods are not supported, including |
---|
| 915 | delete(). Nothing changed in NDFSFile.java though. Leave it for future |
---|
| 916 | improvement (John Xing, 20041022). |
---|
| 917 | |
---|
| 918 | 16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java. |
---|
| 919 | A new status code CANT_PARSE is added to FetcherOutput.java. |
---|
| 920 | Without option -noParsing , no change in fetcher behavior. With |
---|
| 921 | option -noParsing, fetcher does crawls only, no parsing is carried out. |
---|
| 922 | Then, ParseSegment.java should be used to parse in separate pass. |
---|
| 923 | (John Xing, 20041025) |
---|
| 924 | |
---|
| 925 | 17. Added ontology plugin. Currently it is used for query refinement, as |
---|
| 926 | examplified in refine-query-init.jsp and refine-query.jsp. By default, |
---|
| 927 | query refinement is disabled in search.jsp. Please check |
---|
| 928 | ./src/plugin/ontology/README.txt for further description. |
---|
| 929 | Ontology plugin certainly can be used for many other things. |
---|
| 930 | (Michael J. Pan via John Xing, 20041129) |
---|
| 931 | |
---|
| 932 | 18. Changed fetcher.server.delay to be a float, so that sub-second |
---|
| 933 | delays can be specified. (cutting) |
---|
| 934 | |
---|
| 935 | 19. Added plugin.includes config parameter that determines which |
---|
| 936 | plugins are included. By default now only http, html and basic |
---|
| 937 | indexing and search plugins are enabled, rather than all plugins. |
---|
| 938 | This should make default performance more predictable and reliable |
---|
| 939 | going forward. (cutting) |
---|
| 940 | |
---|
| 941 | 20. Cleaned up some filesystem code, including: |
---|
| 942 | |
---|
| 943 | - Replaced BufferedRandomAccessFile with two simpler utilties, |
---|
| 944 | NFSDataInputStream and NFSDataOutputStream. |
---|
| 945 | |
---|
| 946 | - Fixed the bug where SequenceFiles were no longer flushed when |
---|
| 947 | created, so that, when fetches crashed, segments were |
---|
| 948 | unreadable. Now segments are always readable after crashes. |
---|
| 949 | Only the contents of the last buffer is lost. |
---|
| 950 | |
---|
| 951 | - Simplified the FSOutputStream API to not include seek(). We |
---|
| 952 | should never need that functionality. |
---|
| 953 | |
---|
| 954 | - Simplified LocalFileSystem's implementations of FSInputStream |
---|
| 955 | and FSOutputStream and optimized FSInputStream.seek(). |
---|
| 956 | |
---|
| 957 | (cutting) |
---|
| 958 | |
---|
| 959 | 21. Fixed BasicUrlNormalizer to better handle relative urls. The file |
---|
| 960 | part of a URL is normalized in the following manner: |
---|
| 961 | |
---|
| 962 | 1. "/aa/../" will be replaced by "/" This is done step by step until |
---|
| 963 | the url doesn´t change anymore. So we ensure, that |
---|
| 964 | "/aa/bb/../../" will be replaced by "/", too |
---|
| 965 | |
---|
| 966 | 2. leading "/../" will be replaced by "/" |
---|
| 967 | |
---|
| 968 | (Sven Wende via cutting) |
---|
| 969 | |
---|
| 970 | 22. Fix Page constructors so that next fetch date is less likely to be |
---|
| 971 | misconstrued as a float. This patches a problem in WebDBInjector, |
---|
| 972 | where new pages were added to the db with nextScore set to the |
---|
| 973 | intended nextFetch date. This, in turn, confused link analysis. |
---|
| 974 | |
---|
| 975 | 23. In ndfs code, replace addLocalFile(), putToLocalFile() with |
---|
| 976 | copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and |
---|
| 977 | moveToLocalFile(). (John Xing, 20041217) |
---|
| 978 | |
---|
| 979 | 24. Added new config parameter fetcher.threads.per.host. This is used |
---|
| 980 | by the Http protocol. When this is one behavior is as before. |
---|
| 981 | When this is greater than one then multiple threads are permitted |
---|
| 982 | to access a host at once. Note that fetcher.server.delay is no |
---|
| 983 | longer consistently observed when this is greater than one. |
---|
| 984 | (Luke Baker via Doug Cutting) |
---|
| 985 | |
---|
| 986 | Release 0.5 |
---|
| 987 | |
---|
| 988 | 1. Changed plugin directory to be a list of directories. |
---|
| 989 | |
---|
| 990 | 2. Permit Plugin to be the default plugin implementation. |
---|
| 991 | |
---|
| 992 | 3. Added pluggable interface for network protocols in new package |
---|
| 993 | net.nutch.protocol. Moved http code from core into a plugin. |
---|
| 994 | |
---|
| 995 | 4. Added pluggable interface for content parsing in new package |
---|
| 996 | net.nutch.parse. Moved html parsing code from core into a |
---|
| 997 | plugin. |
---|
| 998 | |
---|
| 999 | 5. Fixed a bug in NutchAnalysis where 16-bit characters were not |
---|
| 1000 | processed correctly. |
---|
| 1001 | |
---|
| 1002 | 6. Fixed bug #971731: random summaries on result page. |
---|
| 1003 | (Daniel Naber via cutting) |
---|
| 1004 | |
---|
| 1005 | 7. Made Nutch logo transparent. (Daniel Naber via cutting) |
---|
| 1006 | |
---|
| 1007 | 8. Added file protocol plugin. (John Xing via cutting) |
---|
| 1008 | |
---|
| 1009 | 9. Added ftp protocol plugin. (John Xing via cutting) |
---|
| 1010 | |
---|
| 1011 | 10. Added pdf and msword parser plugins. (John Xing via cutting) |
---|
| 1012 | |
---|
| 1013 | 11. Added pluggable indexing interface. By default, url, content, |
---|
| 1014 | anchors and title are indexed, as before, but now one can easily |
---|
| 1015 | alter this to, e.g., index metadata. A demonstration is provided |
---|
| 1016 | which extracts and indexes Creative Commons license urls. (cutting) |
---|
| 1017 | |
---|
| 1018 | 12. Add language identification plugin. |
---|
| 1019 | |
---|
| 1020 | The process of identification is as follows: |
---|
| 1021 | |
---|
| 1022 | 1. html (html only, HTML 4.0 "lang" attribute) |
---|
| 1023 | 2. meta tags (html only, http-equiv, dc.language) |
---|
| 1024 | 3. http header (Content-Language) |
---|
| 1025 | 4. if all above fail "statistical analysis" |
---|
| 1026 | |
---|
| 1027 | 1 & 2 are run during the fetching phase and 3 & 4 are run on |
---|
| 1028 | indexing phase. |
---|
| 1029 | |
---|
| 1030 | Currently supported languages (in "statistical analysis") are |
---|
| 1031 | da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed |
---|
| 1032 | from http://www.isi.edu/~koehn/europarl/ and the profiles were |
---|
| 1033 | build with tool supplied in patch. |
---|
| 1034 | |
---|
| 1035 | After indexing the language can be found from field named "lang" |
---|
| 1036 | |
---|
| 1037 | It's not 100% accurate but it's a start. |
---|
| 1038 | (Sami Siren) |
---|
| 1039 | |
---|
| 1040 | 13. Added SegmentMergeTool and "mergesegs" command, to remove |
---|
| 1041 | duplicated or otherwise not used content from several segments and |
---|
| 1042 | joining them together into a single new segment. The tool also |
---|
| 1043 | optionally performs several other steps required for proper |
---|
| 1044 | operation of Nutch - such as indexing segments, deleting |
---|
| 1045 | duplicates, merging indices, and indexing the new single segment. |
---|
| 1046 | (Andrzej Bialecki) |
---|
| 1047 | |
---|
| 1048 | 14. Add the ability to retrieve ParseData of a search hit. ParseData |
---|
| 1049 | contains many valuable properties of a search hit. |
---|
| 1050 | |
---|
| 1051 | This is required (among others) to properly display the cached |
---|
| 1052 | content because it's not possible to determine the character |
---|
| 1053 | encoding from the output of the getContent() method (which returns |
---|
| 1054 | byte[]). The symptoms are that for HTML pages using non-latin1 or |
---|
| 1055 | non-UTF8 encodings the cached preview will almost certainly look |
---|
| 1056 | broken. Using the attached patch it is possible to determine the |
---|
| 1057 | character encoding from the ParseData (for HTTP: Content-Type |
---|
| 1058 | metadata), and encode the content accordingly. (Andrzej Bialecki) |
---|
| 1059 | |
---|
| 1060 | 15. Add a pluggable query interface. By default, the content, anchor |
---|
| 1061 | and url fields are searched as before. A sample plugin indexes |
---|
| 1062 | the host name and adds a "site:" keyword to query parsing. |
---|
| 1063 | |
---|
| 1064 | 16. Added support for "lang:" in queries. For example, searching with |
---|
| 1065 | "lang:en" restricts results to pages which were identified to |
---|
| 1066 | be in English. |
---|
| 1067 | |
---|
| 1068 | 17. Automatically optimize field queries to use cached Lucene filters. |
---|
| 1069 | This makes, for example, searches restricted by languages or sites |
---|
| 1070 | that are very common much faster. |
---|
| 1071 | |
---|
| 1072 | 18. Improved charset handling in jsp pages. (jshin by cutting) |
---|
| 1073 | |
---|
| 1074 | 19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting) |
---|
| 1075 | |
---|
| 1076 | 20. When parsing crawled pages, interpret charset specifications in |
---|
| 1077 | html meta tags. (jshin by cutting) |
---|
| 1078 | |
---|
| 1079 | 21. Added support for "cc:licensed" in queries, which searches for documents |
---|
| 1080 | released under Creative Commons licenses. Attributes of the |
---|
| 1081 | license may also be queried, with, e.g., "cc:by" for |
---|
| 1082 | attribution-required licenses, "cc:nc" for non-commercial |
---|
| 1083 | licenses, etc. |
---|
| 1084 | |
---|
| 1085 | 22. Relative paths named in plugin.folders are now searched for on the |
---|
| 1086 | classpath. This makes, e.g., deployment in a war file much simpler. |
---|
| 1087 | |
---|
| 1088 | 23. Modifications to Fetcher.java. |
---|
| 1089 | |
---|
| 1090 | 1. Make sure it works properly with regard to creation and initialization |
---|
| 1091 | of plugin instances. The problem was that multiple threads race to |
---|
| 1092 | startUp() or shutDown() plugin instances. It was solved by synchronizing |
---|
| 1093 | certain codes in PluginRepository.java and Extension.java. |
---|
| 1094 | (Stefan Groschupf via John Xing) |
---|
| 1095 | |
---|
| 1096 | 2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads |
---|
| 1097 | may never return (quit) if there are still data or other structures |
---|
| 1098 | (e.g., persistent socket connections) associated with plugins. (John Xing) |
---|
| 1099 | |
---|
| 1100 | 3. Fixed one type of Fetcher "hang" problems by monitoring named |
---|
| 1101 | FetcherThreads. If all FetcherThreads are gone (finished), |
---|
| 1102 | Fetcher.java is considered done. The problem was: there could be |
---|
| 1103 | runaway threads started by external libs via FetcherThreads. |
---|
| 1104 | Those threads never return, thus keep Fetcher from exiting normally. |
---|
| 1105 | (John Xing) |
---|
| 1106 | |
---|
| 1107 | 24. Eliminate excessive hits from sites. This is done efficiently by |
---|
| 1108 | adding the site name to Hit instances, and, when needed, |
---|
| 1109 | re-querying with too-frequent sites prohibited in the query. |
---|
| 1110 | |
---|
| 1111 | |
---|
| 1112 | Release 0.4 |
---|
| 1113 | |
---|
| 1114 | 1. Http class refactored. (Kevin Smith via Tom Pierce) |
---|
| 1115 | |
---|
| 1116 | 2. Add Finnish translation. (Sampo Syreeni via Doug Cutting) |
---|
| 1117 | |
---|
| 1118 | 3. Added Japanese translation. (Yukio Andoh via Doug Cutting) |
---|
| 1119 | |
---|
| 1120 | 4. Updated Dutch translation. (Ype Kingma via Doug Cutting) |
---|
| 1121 | |
---|
| 1122 | 5. Initial version of Distributed DB code. (Mike Cafarella) |
---|
| 1123 | |
---|
| 1124 | 6. Make things more tolerant of crashed fetcher output files. |
---|
| 1125 | (Doug Cutting) |
---|
| 1126 | |
---|
| 1127 | 7. New skin for website. (Frank Henze via Doug Cutting) |
---|
| 1128 | |
---|
| 1129 | 8. Added Spanish translation. (Diego Basch via Doug Cutting) |
---|
| 1130 | |
---|
| 1131 | 9. Add FTP support to fetcher. (John Xing via Doug Cutting) |
---|
| 1132 | |
---|
| 1133 | 10. Added Thai translation. (Pichai Ongvasith via Doug Cutting) |
---|
| 1134 | |
---|
| 1135 | 11. Added Robots.txt & throttling support to Fetcher.java. (Mike |
---|
| 1136 | Cafarella) |
---|
| 1137 | |
---|
| 1138 | 12. Added nightly build. (Doug Cutting) |
---|
| 1139 | |
---|
| 1140 | 13. Default all link scores to 1.0. (Doug Cutting) |
---|
| 1141 | |
---|
| 1142 | 14. Permit one to keep internal links. (Doug Cutting) |
---|
| 1143 | |
---|
| 1144 | 15. Fixed dedup to select shortest URL. (Doug Cutting) |
---|
| 1145 | |
---|
| 1146 | 16. Changed index merger so that merged index is written to named |
---|
| 1147 | directory, rather than to a generated name in that directory. |
---|
| 1148 | (Doug Cutting) |
---|
| 1149 | |
---|
| 1150 | 17. Disable coordination weighting of query clauses and other minor |
---|
| 1151 | scoring improvements. (Doug Cutting) |
---|
| 1152 | |
---|
| 1153 | 18. Added a new command, crawl, that constructs a database, injects a |
---|
| 1154 | url file and performs a few rounds of generate/fetch/updatedb. |
---|
| 1155 | This simplifies use for intranet sites. Changed some defaults to |
---|
| 1156 | be more intranet friendly. (Doug Cutting) |
---|
| 1157 | |
---|
| 1158 | 19. Fixed a bug where Fetcher.java didn't construct correct relative |
---|
| 1159 | links when a page was redirected. (Doug Cutting) |
---|
| 1160 | |
---|
| 1161 | 20. Fixed a query parser problem with lookahead over plusses and minuses. |
---|
| 1162 | (Doug Cutting) |
---|
| 1163 | |
---|
| 1164 | 21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting) |
---|
| 1165 | |
---|
| 1166 | 22. Permit searching while fetching and/or indexing. |
---|
| 1167 | (Sami Siren via Doug Cutting) |
---|
| 1168 | |
---|
| 1169 | 23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting) |
---|
| 1170 | |
---|
| 1171 | 24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting) |
---|
| 1172 | |
---|
| 1173 | 25. Added Catalan translation. (Xavier Guardiola via Doug Cutting) |
---|
| 1174 | |
---|
| 1175 | 26. Added brazilian portuguese translation. |
---|
| 1176 | (A. Moreir via Doug Cutting) |
---|
| 1177 | |
---|
| 1178 | 27. Added a french translation. (Julien Nioche via Doug Cutting) |
---|
| 1179 | |
---|
| 1180 | 28. Updated to Lucene 1.4RC3. (Doug Cutting) |
---|
| 1181 | |
---|
| 1182 | 29. Add capability to boost by link count & use it in crawl tool. |
---|
| 1183 | (Doug Cutting) |
---|
| 1184 | |
---|
| 1185 | 30. Added plugin system. (Stefan Groschupf via Doug Cutting) |
---|
| 1186 | |
---|
| 1187 | 31. Add this change log file, for recording significant changes to |
---|
| 1188 | Nutch. Populate it with changes from the last few months. |
---|