文件的 nutch 1.16 parsechecker 问题:/目录/输入

nutch 1.16 parsechecker issue with file:/directory/ inputs

nutch 1.16 skips file:/directory styled links in file system crawl 开始,我一直在尝试(但失败了)让 nutch 在 Windows 10 安装中爬过不同的目录和子目录,使用 Cygwin 调用命令。 用于启动爬网的文件 dirs/seed.txt 包含以下内容:

file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

运行 cat ./dirs/seed.txt | ./bin/nutch normalizerchecker -stdin 检查 Nutch 是如何规范化的(默认正则表达式-normalize.xml)产量

 file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
 file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
 file:/localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

同时 运行 cat ./dirs/seed.txt | ./bin/nutch filterchecker -stdin returns:

+file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/

意味着所有目录都被视为有效。到目前为止,还不错,但是 运行 以下内容:

cat ./dirs/seed.txt | ./bin/nutch parsechecker -stdin

对所有三个目录产生相同的错误,即:

Fetch failed with protocol status: notfound(14), lastModified=0

日志中的文件也没有真正告诉我出了什么问题,只是无论如何它都不会读取输入,因为日志中每个条目只包含一条 "fetching directory X" 消息。 .

那么这里到底发生了什么?为了完整起见,我还将保留 nutch-site.xml 、 regex-urlfilter.txt 和 regex-normalize.xml 文件。

nutch-site.xml :

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchSpiderTest</value>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchSpiderTest,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>I am just testing nutch, please tell me if it's bothering your website</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
 <name>file.content.limit</name>
 <value>-1</value>
 <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.</description>
</property>


<property>
    <name>parser.skip.truncated</name>
    <value>false</value>
    <description>Boolean value for whether we should skip parsing for truncated documents. By default this
        property is activated due to extremely high levels of CPU which parsing can sometimes take.
    </description>
</property>
<!-- the following is just an attempt at using a solution I found elsewhere, didn't work -->
<property>
  <name>http.robot.rules.whitelist</name>
  <value>file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/</value>
  <description>Comma separated list of hostnames or IP addresses to ignore 
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!
  </description>
</property>

</configuration>

正则表达式-urlfilter.txt:

# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip http: ftp: mailto: and https: urls
-^(http|ftp|mailto|https):

# This change is not necessary but may make your life easier.  
# Any file types you do not want to index need to be added to the list otherwise 
# Nutch will often try to parse them and fail in doing so as it doesnt know 
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$

# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
+.*(/[^/]+)/[^/]+/[^/]+/

# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
#   can be leaked by placing links pointing to web interfaces of services
#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
#   or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
#     http://localhost:8080
#     http://127.0.0.1/ .. http://127.255.255.255/
#     http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
#     10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
#     192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
#     172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)

# accept anything else
+.

正则表达式-normalize.txt:

<?xml version="1.0"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
     This is intended so that users can specify substitutions to be
     done on URLs using the Java regex syntax, see
     https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
     The rules are applied to URLs in the order they occur in this file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
     expanded to &amp; -->

<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution></substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution></substitution>
</regex>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- normalize file:/// protocol prefix: -->
<!--  keep one single slash (NUTCH-1483) -->
<regex>
  <pattern>^file://+</pattern>
  <substitution>file:/</substitution>
</regex>

<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->

<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>

</regex-normalize>

知道我做错了什么吗?

Nutch的文件:协议实现"fetches"本地文件通过创建一个File object using the path component of the URL: /cygdrive/c/Users/abc/Desktop/anotherdirectory/. As stated in the discussion "Is there a java sdk for cygwin?",Java不翻译路径,而是将cygdrive/c/替换为c:/应该可以。