Python3: 使用 urllib 时出现 HTTP 错误 302

Question

我想从网站上读取不同股票的价值。因此我写了这个小脚本，它读取页面源然后解析出值：

stock_reader.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from re import search
from urllib import request


def main():
    links = [
        [
            'CSG',
            'UBS',
        ],
        [
            'http://www.tradegate.de/orderbuch.php?isin=CH0012138530',
            'http://www.tradegate.de/orderbuch.php?isin=CH0244767585',
        ],
    ]

    for i in in range(len(links[0])):
        url = links[1][i]
        htmltext = request.urlopen(url).read().decode('utf-8')
        source = htmltext.splitlines()
        for line in source:
            if 'id="bid"' in line:
                m = search('\d+.\d+', line)
                print('{}'.format(m.string[m.start():m.end()]))


if __name__ == '__main__':
    main()

有时有效，但有时会引发此错误：

错误信息

Traceback (most recent call last):
  File "./aktien_reader.py", line 39, in <module>
    main()
  File "./aktien_reader.py", line 30, in main
    htmltext = request.urlopen(url).read().decode('utf-8')
  File "/usr/lib/python3.3/urllib/request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 479, in open
    response = meth(req, response)
  File "/usr/lib/python3.3/urllib/request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.3/urllib/request.py", line 511, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 451, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 479, in open
    response = meth(req, response)
  File "/usr/lib/python3.3/urllib/request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.3/urllib/request.py", line 511, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 451, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 479, in open
    response = meth(req, response)
  File "/usr/lib/python3.3/urllib/request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.3/urllib/request.py", line 511, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 451, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 479, in open
    response = meth(req, response)
  File "/usr/lib/python3.3/urllib/request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.3/urllib/request.py", line 511, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 451, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 479, in open
    response = meth(req, response)
  File "/usr/lib/python3.3/urllib/request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.3/urllib/request.py", line 511, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 451, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 686, in http_error_302
    self.inf_msg + msg, headers, fp)
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found

我的问题是：为什么会这样，我该如何避免？

Answer 1

HTTP 状态代码 302 它是一种重定向，它将有一个 header 和一个新的 URL 用于访问（不需要有效的 URL..）

地点：http://www.example.com/x/y/

这经常用于阻止在短时间内发出许多请求的机器人。所以不是编码问题。

Answer 2

发生这种情况可能是因为目标站点使用 cookie 并在您不发送 cookie 时重定向您。

你可以使用的是这样的东西：

from http.cookiejar import CookieJar

url = "http://www.tradegate.de/orderbuch.php?isin=CH0012138530"

req = urllib.request.Request(url, None, {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})

cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
response.read()

这样，您支持 Cookie，网站将允许您获取页面:-)

另一种方法是使用 requests package，这是最简单易用的方法。在你的情况下，它会导致：

import requests

url = "http://www.tradegate.de/orderbuch.php?isin=CH0012138530"
r = requests.get(url, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}, timeout=15)
print(r.content)

Answer 3

这个答案是对 Cédric J 的答案的简化。你 并不真的需要导入 CookieJar 或设置各种 Accept headers 如果你不想。但是，您通常应该 设置超时 。它是用 Python 3.7 测试的。我通常 记得为每个我想要 cookie 的随机 URL 使用一个新的开场白。

from urllib.request import build_opener, HTTPCookieProcessor, Request
url = 'https://www.cell.com/cell-metabolism/fulltext/S1550-4131(18)30630-2'
opener = build_opener(HTTPCookieProcessor())

没有 Request object:

response = opener.open(url, timeout=30)
content = response.read()

与 Request object:

request = Request(url)
response = opener.open(request, timeout=30)
content = response.read()

Python3: 使用 urllib 时出现 HTTP 错误 302

Python3: HTTP Error 302 while using urllib

python

urllib

python-3.x

web

stock_reader.py

错误信息