urlopen 返回有效链接的重定向错误

Question

我正在 python 中构建一个损坏的 link 检查器，它正在成为构建正确识别 link 的逻辑的苦差事，这些逻辑在使用浏览器访问时无法解析.我找到了一组 links，我可以在其中用我的抓取工具始终如一地重现重定向错误，但在浏览器中访问时可以完美解决。我希望我能在这里找到一些见解。

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    response = urllib.request.urlopen(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)


print(output)

在这种情况下，URL 可靠地 return 解决此错误的示例是“http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html”。它在访问时完美解析，但上面的代码将 return 出现以下错误：

HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

有什么想法可以正确识别这些 link 的功能，而不是盲目地忽略来自该站点的 link（这可能会错过真正损坏的 link）？

Answer 1

您收到无限循环错误是因为您要抓取的页面使用了 cookie 并在客户端未发送 cookie 时重定向。当您禁用 cookie 时，您将在大多数其他抓取工具和浏览器中遇到相同的错误。

您需要一个 http.cookiejar.CookieJar 和一个 urllib.request.HTTPCookieProcessor 来避免重定向循环：

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

try:
    req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
    cj = CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    response = opener.open(req)
    raw_response = response.read().decode('utf8', errors='ignore')
    response.close()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)

Answer 2

我同意第一个答案中的评论，但它对我不起作用（我得到了一些 encoded/compressed 字节数据，没有可读的内容）

link提到使用了urllib2。它也适用于 python 3.7 中的 urllib，如下所示：

from urllib.request import build_opener, HTTPCookieProcessor
opener = build_opener(HTTPCookieProcessor())
response = opener.open('http://www.bad.org.uk')
print response.read()

Answer 3

我尝试了上面的解决方案，但没有成功。

当您尝试打开的 URL 格式不正确（或者不是 REST 服务所期望的）时，似乎会出现此问题。例如，我发现我的问题是因为我请求 https://host.com/users/4484486 主机希望最后有一个斜杠：https://host.com/users/4484486/ 解决了问题。

urlopen 返回有效链接的重定向错误

urlopen Returning Redirect Error for Valid Links

urllib

httprequest

python-3.x