Changes between Initial Version and Version 1 of crawl-urlfilter.txt


Ignore:
Timestamp:
Apr 22, 2008, 5:08:21 PM (16 years ago)
Author:
waue
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • crawl-urlfilter.txt

    v1 v1  
     1{{{
     2# The url filter file used by the crawl command.
     3
     4# Better for intranet crawling.
     5# Be sure to change MY.DOMAIN.NAME to your domain name.
     6
     7# Each non-comment, non-blank line contains a regular expression
     8# prefixed by '+' or '-'.  The first matching pattern in the file
     9# determines whether a URL is included or ignored.  If no pattern
     10# matches, the URL is ignored.
     11
     12# skip file:, ftp:, & mailto: urls
     13-^(file|ftp|mailto):
     14
     15# skip image and other suffixes we can't yet parse
     16-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|pdf|PDF)$
     17
     18# skip URLs containing certain characters as probable queries, etc.
     19-[*!@]
     20
     21# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
     22-.*(/.+?)/.*?\1/.*?\1/
     23
     24# accept hosts in MY.DOMAIN.NAME
     25+^http://([a-z0-9]*\.)*.*/
     26
     27# skip everything else
     28-.
     29}}}