source: nutchez-0.1/conf/regex-urlfilter.txt.bek @ 66

Last change on this file since 66 was 66, checked in by waue, 15 years ago

NutchEz - an easy way to nutch

  • Property svn:executable set to *
File size: 1.5 KB
Line 
1# Licensed to the Apache Software Foundation (ASF) under one or more
2# contributor license agreements.  See the NOTICE file distributed with
3# this work for additional information regarding copyright ownership.
4# The ASF licenses this file to You under the Apache License, Version 2.0
5# (the "License"); you may not use this file except in compliance with
6# the License.  You may obtain a copy of the License at
7#
8#     http://www.apache.org/licenses/LICENSE-2.0
9#
10# Unless required by applicable law or agreed to in writing, software
11# distributed under the License is distributed on an "AS IS" BASIS,
12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13# See the License for the specific language governing permissions and
14# limitations under the License.
15
16
17# The default url filter.
18# Better for whole-internet crawling.
19
20# Each non-comment, non-blank line contains a regular expression
21# prefixed by '+' or '-'.  The first matching pattern in the file
22# determines whether a URL is included or ignored.  If no pattern
23# matches, the URL is ignored.
24
25# skip file: ftp: and mailto: urls
26-^(file|ftp|mailto):
27
28# skip image and other suffixes we can't yet parse
29-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
30
31# skip URLs containing certain characters as probable queries, etc.
32-[?*!@=]
33
34# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
35-.*(/[^/]+)/[^/]+\1/[^/]+\1/
36
37# accept anything else
38+.
Note: See TracBrowser for help on using the repository browser.