为什么 urllib.request.urlopen 有时不工作,但浏览器工作?
Why does urllib.request.urlopen sometimes does not work, but browsers work?
我正在尝试使用 Python 的 urllib.request
下载一些内容。以下命令产生异常:
import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)
结果:
...
HTTPError: HTTP Error 403: Forbidden
如果我使用 firefox 或链接(命令行浏览器),我会得到内容和状态码 200。如果我使用 lynx,很奇怪,我也会得到 403。
我希望所有方法都能奏效
- 同理
- 成功
为什么不是这样?
很可能该网站阻止人们抓取他们的网站。您可以通过包含 header 信息和其他内容来从根本上欺骗他们。有关更多信息,请参见此处。
引用自:https://docs.python.org/3/howto/urllib2.html#headers
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
the_page = response.read()
人们不希望脚本抓取他们的网站的原因有很多。它占用了他们的带宽。他们不希望人们通过制作抓取机器人而受益 (money-wise)。也许他们不希望您复制他们的站点信息。您也可以将其视为一本书。作者希望人们阅读他们的书,但也许他们中的一些人不希望机器人扫描他们的书,创建一个副本,或者机器人可能会总结它。
您在评论中提出的问题的第二部分过于含糊和笼统,因为有太多固执己见的答案。
我试过这段代码,一切正常。
我刚刚在请求中添加了 headers
。请参阅以下示例:
from urllib.request import Request, urlopen, HTTPError
from time import sleep
def get_url_data(url = ""):
try:
request = Request(url, headers = {'User-Agent' :\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"})
response = urlopen(request)
data = response.read().decode("utf8")
return data
except HTTPError:
return None
url = "https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/"
for i in range(50):
d = get_url_data(url)
if d != None:
print("Attempt %d was a Success" % i)
else:
print("Attempt %d was a Failure" % i)
sleep(1)
输出:
Attempt 0 was a Success
Attempt 1 was a Success
Attempt 2 was a Success
Attempt 3 was a Success
Attempt 4 was a Success
Attempt 5 was a Success
Attempt 6 was a Success
Attempt 7 was a Success
Attempt 8 was a Success
Attempt 9 was a Success
...
Attempt 42 was a Success
Attempt 43 was a Success
Attempt 44 was a Success
Attempt 45 was a Success
Attempt 46 was a Success
Attempt 47 was a Success
Attempt 48 was a Success
Attempt 49 was a Success
我正在尝试使用 Python 的 urllib.request
下载一些内容。以下命令产生异常:
import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)
结果:
...
HTTPError: HTTP Error 403: Forbidden
如果我使用 firefox 或链接(命令行浏览器),我会得到内容和状态码 200。如果我使用 lynx,很奇怪,我也会得到 403。
我希望所有方法都能奏效
- 同理
- 成功
为什么不是这样?
很可能该网站阻止人们抓取他们的网站。您可以通过包含 header 信息和其他内容来从根本上欺骗他们。有关更多信息,请参见此处。
引用自:https://docs.python.org/3/howto/urllib2.html#headers
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
the_page = response.read()
人们不希望脚本抓取他们的网站的原因有很多。它占用了他们的带宽。他们不希望人们通过制作抓取机器人而受益 (money-wise)。也许他们不希望您复制他们的站点信息。您也可以将其视为一本书。作者希望人们阅读他们的书,但也许他们中的一些人不希望机器人扫描他们的书,创建一个副本,或者机器人可能会总结它。
您在评论中提出的问题的第二部分过于含糊和笼统,因为有太多固执己见的答案。
我试过这段代码,一切正常。
我刚刚在请求中添加了 headers
。请参阅以下示例:
from urllib.request import Request, urlopen, HTTPError
from time import sleep
def get_url_data(url = ""):
try:
request = Request(url, headers = {'User-Agent' :\
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"})
response = urlopen(request)
data = response.read().decode("utf8")
return data
except HTTPError:
return None
url = "https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/"
for i in range(50):
d = get_url_data(url)
if d != None:
print("Attempt %d was a Success" % i)
else:
print("Attempt %d was a Failure" % i)
sleep(1)
输出:
Attempt 0 was a Success
Attempt 1 was a Success
Attempt 2 was a Success
Attempt 3 was a Success
Attempt 4 was a Success
Attempt 5 was a Success
Attempt 6 was a Success
Attempt 7 was a Success
Attempt 8 was a Success
Attempt 9 was a Success
...
Attempt 42 was a Success
Attempt 43 was a Success
Attempt 44 was a Success
Attempt 45 was a Success
Attempt 46 was a Success
Attempt 47 was a Success
Attempt 48 was a Success
Attempt 49 was a Success