Nutch 不一致地忽略重定向
Nutch inconsistently ignores redirects
我 运行 在抓取 (nutch 1.9/openjdk7) 非常简单的重定向案例时遇到了麻烦。
这是该过程的数据包捕获。
Time Source Destination Protocol Info
12.988003 99.99.99.99 8.8.4.4 DNS Standard query 0xc165 A bloomberg.com
13.032343 8.8.4.4 99.99.99.99 DNS Standard query response 0xc165 A 69.191.212.191 A 69.191.251.238
13.124471 99.99.99.99 69.191.212.191 HTTP GET /robots.txt HTTP/1.0
13.228846 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.264230 99.99.99.99 8.8.4.4 DNS Standard query 0x7089 A www.bloomberg.com
13.344767 8.8.4.4 99.99.99.99 DNS Standard query response 0x7089 CNAME www.bloomberg.com.edgekey.net CNAME e4569.x.akamaiedge.net A 23.214.189.136
13.351030 99.99.99.99 23.214.189.136 HTTP GET /robots.txt HTTP/1.0
13.359121 23.214.189.136 99.99.99.99 HTTP HTTP/1.0 200 OK (text/plain)
13.448604 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.537211 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.640146 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.738564 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
Nutch 尝试获取 http://bloomberg.com which replies with a 301 redirect to http://www.bloomberg.com。 robots.txt 的重定向已正确处理。但是,对于 'get /',fetcher 一直在尝试原始主机名,它一直在回复 301。无论 http.redirect.max 有多大,fetch 都会失败(我已经检查了 10 个)。
Nutch 1.9 运行 开
OpenJDK 运行时环境 (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.12.04.1)
OpenJDK 客户端 VM(build 24.65-b04,混合模式,共享)
这是错误(您能确认一下吗)还是配置错误?
谢谢。
这是一个错误,1.10 必须随修复一起提供:
https://github.com/apache/nutch/commit/ed052df8822380ccfa89a9ffa1df324933669a59
我 运行 在抓取 (nutch 1.9/openjdk7) 非常简单的重定向案例时遇到了麻烦。 这是该过程的数据包捕获。
Time Source Destination Protocol Info
12.988003 99.99.99.99 8.8.4.4 DNS Standard query 0xc165 A bloomberg.com
13.032343 8.8.4.4 99.99.99.99 DNS Standard query response 0xc165 A 69.191.212.191 A 69.191.251.238
13.124471 99.99.99.99 69.191.212.191 HTTP GET /robots.txt HTTP/1.0
13.228846 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.264230 99.99.99.99 8.8.4.4 DNS Standard query 0x7089 A www.bloomberg.com
13.344767 8.8.4.4 99.99.99.99 DNS Standard query response 0x7089 CNAME www.bloomberg.com.edgekey.net CNAME e4569.x.akamaiedge.net A 23.214.189.136
13.351030 99.99.99.99 23.214.189.136 HTTP GET /robots.txt HTTP/1.0
13.359121 23.214.189.136 99.99.99.99 HTTP HTTP/1.0 200 OK (text/plain)
13.448604 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.537211 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
13.640146 99.99.99.99 69.191.212.191 HTTP GET / HTTP/1.0
13.738564 69.191.212.191 99.99.99.99 HTTP HTTP/1.1 301 Moved Permanently (text/html)
Nutch 尝试获取 http://bloomberg.com which replies with a 301 redirect to http://www.bloomberg.com。 robots.txt 的重定向已正确处理。但是,对于 'get /',fetcher 一直在尝试原始主机名,它一直在回复 301。无论 http.redirect.max 有多大,fetch 都会失败(我已经检查了 10 个)。
Nutch 1.9 运行 开 OpenJDK 运行时环境 (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.12.04.1) OpenJDK 客户端 VM(build 24.65-b04,混合模式,共享)
这是错误(您能确认一下吗)还是配置错误?
谢谢。
这是一个错误,1.10 必须随修复一起提供: https://github.com/apache/nutch/commit/ed052df8822380ccfa89a9ffa1df324933669a59