Apache Nutch 2.3.1 Fetcher 给出无效的 uri 异常

Apache Nutch 2.3.1 Fetcher giving Invalid uri exception

我已经使用 Hadoop 生态系统配置了 Apache Nutch 2.3.1。我必须获取一些阿拉伯语脚本网站。 Nutch 在提取时为少数 URL 提供例外。以下是一个示例异常

java.lang.IllegalArgumentException: Invalid uri 'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html': escaped absolute path not valid
    at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)
    at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
    at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:77)
    at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:173)
    at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:245)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:564)

我什至在 1.x 分支上也能重现这个问题。问题是 Apache HTTP 客户端库内部使用的 Java URI class 不支持非转义 UTF-8 字符:

来自 Javajava.net.URI 的文档:

Character categories

RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:

  • alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
  • digit The US-ASCII decimal digit characters, '0' through '9'
  • alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
  • punct The characters in the string ",;:$&+="
  • reserved All punct characters together with those in the string "?/[]@"
  • escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
  • other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)

The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.

正确转义 URL 看起来更像:

http://agahi.safirak.com/ads/850/%D9%BE%DB%8C%DA%86-%D8%A8%D9%86%D8%AF-%D8%A8%D8%A7%D8%AF%DB%8C-%D9%87%D9%81%D8%AA%DB%8C%D8%B1%DB%8C-1800-%D8%AF%D9%88%D8%B1-%D8%A8%D8%A7%D8%AF%DB%8C-%D8%AC%DB%8C%D8%B3%D9%88%D9%86.html

实际上,如果您在 Chrome 上打开示例 URL,然后从地址栏复制 URL,您将获得转义表示。请随意为此打开一个问题(否则我会这样做)。同时,您可以尝试使用不使用 Apache HTTP 客户端的 protocol-http 插件。我在本地测试过,parsechecker 工作正常:

➜  local (master) ✗ bin/nutch parsechecker "http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html"
fetching: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
parsing: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
contentType: text/html
signature: 048b390ab07464f5d61ae09646253529
---------
Url
---------------

http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: پیچ بند بادی هفتیری 1800 دور بادی جیسون-نیازمندی سفیرک
Outlinks: 76
outlink: toUrl: http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html anchor: 
outlink: toUrl: http://agahi.safirak.com/assets/fonts/font-awesome/css/font-awesome.min.css anchor: 
outlink: toUrl: http://agahi.safirak.com/assets/css/bootstrap.css anchor:
...