奇怪的 410 http 使用 python urllib 在 wget 中无法重现
Strange 410 http gone using python urllib not reproductible in wget
我正在 python3
中使用 urllib
从我的服务器获取一些图像:
import urllib.request
import urllib.error
try:
resp = urllib.request.urlopen(url)
except urllib.error.HTTPError as err:
print("code " + str(err.status) + " reason " + err.reason)
运行 文件输出 410 HTTP Gone 错误,
$ python3.6 file.py
download: http://some_url.com/image.jpg
code 410 reason Gone
Traceback (most recent call last):
File "file.py", line 32, in <module>
image = image_from_url(url)
但我确定图像在那里,因为 wget
returns 图像很好:
$ wget http://some_url.com/image.jpg
--2019-10-11 16:24:05-- http://some_url.com/image.jpg
Resolving some_url.com...
Connecting to some_url.com|...|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127891 (125K) [image/jpeg]
Saving to: 'image.jpg'
关于造成这种情况的原因有什么想法吗?服务器端的东西? urllib 请求中是否应该包含一些特定的 header?
谢谢
urllib
请求:
GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
Accept-Encoding: identity
Host: upload.wikimedia.org
User-Agent: Python-urllib/3.6
Connection: close
wget
请求:
GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
User-Agent: Wget/1.19.4 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: upload.wikimedia.org
Connection: Keep-Alive
尝试添加 Accept: */*
header?一些研究表明,过滤掉缺少此 header 的请求是一种常见的做法,因为它们通常是机器人。
req = urllib.request.Request('some_url', headers = {'Accept': '*/*'})
resp = urllib.request.urlopen(req)
我正在 python3
中使用 urllib
从我的服务器获取一些图像:
import urllib.request
import urllib.error
try:
resp = urllib.request.urlopen(url)
except urllib.error.HTTPError as err:
print("code " + str(err.status) + " reason " + err.reason)
运行 文件输出 410 HTTP Gone 错误,
$ python3.6 file.py
download: http://some_url.com/image.jpg
code 410 reason Gone
Traceback (most recent call last):
File "file.py", line 32, in <module>
image = image_from_url(url)
但我确定图像在那里,因为 wget
returns 图像很好:
$ wget http://some_url.com/image.jpg
--2019-10-11 16:24:05-- http://some_url.com/image.jpg
Resolving some_url.com...
Connecting to some_url.com|...|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127891 (125K) [image/jpeg]
Saving to: 'image.jpg'
关于造成这种情况的原因有什么想法吗?服务器端的东西? urllib 请求中是否应该包含一些特定的 header?
谢谢
urllib
请求:
GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
Accept-Encoding: identity
Host: upload.wikimedia.org
User-Agent: Python-urllib/3.6
Connection: close
wget
请求:
GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
User-Agent: Wget/1.19.4 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: upload.wikimedia.org
Connection: Keep-Alive
尝试添加 Accept: */*
header?一些研究表明,过滤掉缺少此 header 的请求是一种常见的做法,因为它们通常是机器人。
req = urllib.request.Request('some_url', headers = {'Accept': '*/*'})
resp = urllib.request.urlopen(req)