urllib.request.urlopen 无法获取 Stack Overflow 选举的初选页面
urllib.request.urlopen cannot fetch the primaries page of Stack Overflow elections
我有一点 script 来总结和整理 Stack Exchange 选举初选中的候选人分数。它适用于大多数站点,但 Stack Overflow 除外,其中使用 request.urlopen
或 urllib
检索 URL 失败并出现 403 错误(禁止访问)。演示问题:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
request.urlopen(url).read()
输出,Math SE 和 Server Fault 的 URLs 工作正常,但 Stack Overflow 失败:
fetching http://math.stackexchange.com/election/5?tab=primary ...
fetching http://serverfault.com/election/5?tab=primary ...
fetching http://whosebug.com/election/7?tab=primary ...
Traceback (most recent call last):
File "examples/t.py", line 11, in <module>
request.urlopen(url).read()
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
使用 curl
,所有 URL 都能正常工作。所以这个问题似乎特定于 urllib
的 request.urlopen
。我在 OSX 和 Linux 中试过,结果相同。这是怎么回事?这怎么解释?
使用 requests 而不是 urllib
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
data = requests.get(url)
如果您想通过使用单个 HTTP 会话来稍微提高效率
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
with requests.Session() as session:
for url in urls:
print('fetching {} ...'.format(url))
data = session.get(url)
它似乎是与 urllib 一起发送的用户代理。此代码对我有用:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
try:
request.urlopen(url).read()
except:
print('got an exception, changing user-agent to urllib3 default')
req = request.Request(url)
req.add_header('User-Agent', 'Python-urllib/3.4')
try:
request.urlopen(req)
except:
print('got another exception, changing user-agent to something else')
req.add_header('User-Agent', 'not-Python-urllib/3.4')
request.urlopen(req)
这是当前输出 (2015-11-16),为便于阅读添加了空行:
fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary
fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary
fetching http://whosebug.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://whosebug.com/election/7?tab=primary
我有一点 script 来总结和整理 Stack Exchange 选举初选中的候选人分数。它适用于大多数站点,但 Stack Overflow 除外,其中使用 request.urlopen
或 urllib
检索 URL 失败并出现 403 错误(禁止访问)。演示问题:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
request.urlopen(url).read()
输出,Math SE 和 Server Fault 的 URLs 工作正常,但 Stack Overflow 失败:
fetching http://math.stackexchange.com/election/5?tab=primary ... fetching http://serverfault.com/election/5?tab=primary ... fetching http://whosebug.com/election/7?tab=primary ... Traceback (most recent call last): File "examples/t.py", line 11, in <module> request.urlopen(url).read() File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 161, in urlopen return opener.open(url, data, timeout) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 469, in open response = meth(req, response) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 579, in http_response 'http', request, response, code, msg, hdrs) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 507, in error return self._call_chain(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 441, in _call_chain result = func(*args) File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/urllib/request.py", line 587, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden
使用 curl
,所有 URL 都能正常工作。所以这个问题似乎特定于 urllib
的 request.urlopen
。我在 OSX 和 Linux 中试过,结果相同。这是怎么回事?这怎么解释?
使用 requests 而不是 urllib
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
data = requests.get(url)
如果您想通过使用单个 HTTP 会话来稍微提高效率
import requests
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
with requests.Session() as session:
for url in urls:
print('fetching {} ...'.format(url))
data = session.get(url)
它似乎是与 urllib 一起发送的用户代理。此代码对我有用:
from urllib import request
urls = (
'http://math.stackexchange.com/election/5?tab=primary',
'http://serverfault.com/election/5?tab=primary',
'http://whosebug.com/election/7?tab=primary',
)
for url in urls:
print('fetching {} ...'.format(url))
try:
request.urlopen(url).read()
except:
print('got an exception, changing user-agent to urllib3 default')
req = request.Request(url)
req.add_header('User-Agent', 'Python-urllib/3.4')
try:
request.urlopen(req)
except:
print('got another exception, changing user-agent to something else')
req.add_header('User-Agent', 'not-Python-urllib/3.4')
request.urlopen(req)
这是当前输出 (2015-11-16),为便于阅读添加了空行:
fetching http://math.stackexchange.com/election/5?tab=primary ...
success with url: http://math.stackexchange.com/election/5?tab=primary
fetching http://serverfault.com/election/5?tab=primary ...
success with url: http://serverfault.com/election/5?tab=primary
fetching http://whosebug.com/election/7?tab=primary ...
got an exception, changing user-agent to urllib default
got another exception, changing user-agent to something else
success with url: http://whosebug.com/election/7?tab=primary