Python 套接字通过 HTTP 下载 jpg
Python Sockets Download jpg over HTTP
我在 Python 脚本中看到非常奇怪的行为。我正在使用 Python 套接字从网络上下载图像。我对使用 requests/urllib 不感兴趣。当我尝试下载图像时,它下载成功。但是,当要在照片应用程序中打开文件时,Windows 返回一个 "It looks like we don't support this file format" 错误。
这就是奇怪的部分开始的地方。如果我复制并粘贴我的套接字连接到的 URL(用于下载图像的那个,在本例中为 http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg
)并自己从 Chrome 下载它,然后 运行又是我的脚本,图片下载显示没问题! HTTP 响应 header 中 Content-Length 的数量也增加了。我用 3 张不同的图像做了 3 次,每次都给了我相同的行为。下面是我的脚本的两个 运行,一个是我从 Chrome 下载文件之前,一个是之后。请注意,在第一个 运行 中,Content-Length header 指出响应的 body 中有 2564 个字节。在第二个 运行 中,这个数字变为 3833。他们都在请求相同的 URL。
PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate
ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:58:24 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nAccept-Ranges: bytes\r\nLast-Modified: Sun, 12 Aug 2018 02:06:23 GMT\r\nX-Original-Content-Length: 25378\r\nX-Content-Type-Options: nosniff\r\nExpires: Sun, 12 Aug 2018 02:11:23 GMT\r\nCache-Control: max-age=300,private\r\nContent-Length: 2564\r\nConnection: close\r\nContent-Type: image/webp\r\n\r\nRIFF\xfc\t\...<hex data here>...\x00\x00'
RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:58:24 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
Accept-Ranges: bytes
Last-Modified: Sun, 12 Aug 2018 02:06:23 GMT
X-Original-Content-Length: 25378
X-Content-Type-Options: nosniff
Expires: Sun, 12 Aug 2018 02:11:23 GMT
Cache-Control: max-age=300,private
Content-Length: 2564
Connection: close
Content-Type: image/webp
IMAGE BINARY DATA SPLIT OFF
b'RIFF\xfc\t\...<hex data here>...\x00\x00'
Bytes in image data: 2581
PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate
ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:59:08 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nX-Content-Type-Options: nosniff\r\nAccept-Ranges: bytes\r\nExpires: Mon, 12 Aug 2019 04:58:50 GMT\r\nCache-Control: max-age=31536000\r\nEtag: W/"0"\r\nLast-Modified: Sun, 12 Aug 2018 04:58:50 GMT\r\nX-Original-Content-Length: 25378\r\nContent-Length: 3833\r\nConnection: close\r\nContent-Type: image/jpeg\r\n\r\n\xff\xd8\...<hex data here>...\xff\xd9'
RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:59:08 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Expires: Mon, 12 Aug 2019 04:58:50 GMT
Cache-Control: max-age=31536000
Etag: W/"0"
Last-Modified: Sun, 12 Aug 2018 04:58:50 GMT
X-Original-Content-Length: 25378
Content-Length: 3833
Connection: close
Content-Type: image/jpeg
IMAGE BINARY DATA SPLIT OFF
b'\xff\xd8\...<hex data here>...\xff\xd9'
Bytes in image data: 3850
这是我的代码
class MySocket:
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def myclose(self):
self.sock.close()
def mysend(self, msg, debug=False):
if debug:
print("MESSAGE SENT")
print(msg.decode())
self.sock.sendall(msg)
def myreceive(self, debug=False):
received = b''
buffer = 1
while True:
part = self.sock.recv(buffer)
received += part
if part == b'':
break
if debug:
print("Received...")
print(received)
return received
def download_image(img_url):
"""
Download images with the given socket and list of urls
:param img_url: url corresponding to an image
:return: None
"""
image_socket = MySocket()
image_socket.connect("www.rit.edu", 80)
message = "GET " + img_url + " HTTP/1.1\r\n" \
"Host: www.rit.edu\r\n" \
"Accept: image/webp,image/apng,image/*,*/*;q=0.8\r\n" \
"Accept-Language: en-US,en;q=0.9\r\n" \
"Accept-Encoding: gzip, deflate\r\n\r\n"
image_socket.mysend(message.encode(), True)
reply = image_socket.myreceive()
print("ENTIRE MESSAGE RECEIVED")
print(reply)
print()
headers = reply.split(b'\r\n\r\n')[0]
print("RESPONSE HEADERS SPLIT OFF")
print(headers.decode())
image = reply[len(headers)+4:]
print()
print("IMAGE BINARY DATA SPLIT OFF")
print(image)
print()
print("Bytes in image data:", sys.getsizeof(image))
print()
# print(type(image))
img_name = str(len(os.listdir("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images"))) + img_url[-4:]
f = open(os.path.join("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images", img_name), 'wb')
f.write(image)
f.close()
def main():
download_image("http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg")
main()
谁能告诉我这是怎么回事,为什么 jpg 在第一次尝试时没有下载?
这是您发送的请求的一部分:
Accept: image/webp,image/apng,image/*,*/*;q=0.8
它表明您希望在 image/webp
内容类型之前获得任何其他 image/*
类型的响应。因此,您会在响应中获得 WEBP 图像:
HTTP/1.1 200 OK
...
Content-Length: 2564
...
Content-Type: image/webp
...
b'RIFF\xfc\t\...<hex data here>...\x00\x00'
下次您发送相同的请求时,您会得到不同的响应:
HTTP/1.1 200 OKheaders
...
Content-Length: 3833
...
Content-Type: image/jpeg
...
b'\xff\xd8\...<hex data here>...\xff\xd9'
这次您得到的不是 WEBP 图像,而是 JPEG 图像,可以在 Content-Type
header 和响应 body 中看到。
我不完全确定为什么会这样,但我假设来自 Chrome 的先前请求使服务器从原始源文件创建 JPEG 图像并将其缓存在本地以供以后请求使用,以便现在服务器提供 pre-created JPEG 文件而不是新创建 WEBP 文件的成本更低。并且您的 Accept
header 表示您支持这两种格式。
无论如何,如果您的代码不支持 WEBP 而只支持 JPEG,那么您不应该声称能够在您的 Accept
header 中处理 WEBP。相反,您应该只声明您真正支持的内容,即
Accept: image/jpeg
您在请求中发送的其他信息也是如此。例如,您声称通过发送 Accept-Encoding: gzip, deflate
来支持压缩响应,但您的代码不支持处理压缩响应。类似地,您声称能够通过发送 HTTP/1.1
请求来处理分块传输编码和 HTTP 保持活动状态,但您的代码也不支持任何这些功能。
总而言之,您应该只发送此请求以获得您想要的内容:
GET /.... HTTP/1.0
Host: www.rit.edu
Accept: image/jpeg
我在 Python 脚本中看到非常奇怪的行为。我正在使用 Python 套接字从网络上下载图像。我对使用 requests/urllib 不感兴趣。当我尝试下载图像时,它下载成功。但是,当要在照片应用程序中打开文件时,Windows 返回一个 "It looks like we don't support this file format" 错误。
这就是奇怪的部分开始的地方。如果我复制并粘贴我的套接字连接到的 URL(用于下载图像的那个,在本例中为 http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg
)并自己从 Chrome 下载它,然后 运行又是我的脚本,图片下载显示没问题! HTTP 响应 header 中 Content-Length 的数量也增加了。我用 3 张不同的图像做了 3 次,每次都给了我相同的行为。下面是我的脚本的两个 运行,一个是我从 Chrome 下载文件之前,一个是之后。请注意,在第一个 运行 中,Content-Length header 指出响应的 body 中有 2564 个字节。在第二个 运行 中,这个数字变为 3833。他们都在请求相同的 URL。
PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate
ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:58:24 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nAccept-Ranges: bytes\r\nLast-Modified: Sun, 12 Aug 2018 02:06:23 GMT\r\nX-Original-Content-Length: 25378\r\nX-Content-Type-Options: nosniff\r\nExpires: Sun, 12 Aug 2018 02:11:23 GMT\r\nCache-Control: max-age=300,private\r\nContent-Length: 2564\r\nConnection: close\r\nContent-Type: image/webp\r\n\r\nRIFF\xfc\t\...<hex data here>...\x00\x00'
RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:58:24 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
Accept-Ranges: bytes
Last-Modified: Sun, 12 Aug 2018 02:06:23 GMT
X-Original-Content-Length: 25378
X-Content-Type-Options: nosniff
Expires: Sun, 12 Aug 2018 02:11:23 GMT
Cache-Control: max-age=300,private
Content-Length: 2564
Connection: close
Content-Type: image/webp
IMAGE BINARY DATA SPLIT OFF
b'RIFF\xfc\t\...<hex data here>...\x00\x00'
Bytes in image data: 2581
PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate
ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:59:08 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nX-Content-Type-Options: nosniff\r\nAccept-Ranges: bytes\r\nExpires: Mon, 12 Aug 2019 04:58:50 GMT\r\nCache-Control: max-age=31536000\r\nEtag: W/"0"\r\nLast-Modified: Sun, 12 Aug 2018 04:58:50 GMT\r\nX-Original-Content-Length: 25378\r\nContent-Length: 3833\r\nConnection: close\r\nContent-Type: image/jpeg\r\n\r\n\xff\xd8\...<hex data here>...\xff\xd9'
RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:59:08 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Expires: Mon, 12 Aug 2019 04:58:50 GMT
Cache-Control: max-age=31536000
Etag: W/"0"
Last-Modified: Sun, 12 Aug 2018 04:58:50 GMT
X-Original-Content-Length: 25378
Content-Length: 3833
Connection: close
Content-Type: image/jpeg
IMAGE BINARY DATA SPLIT OFF
b'\xff\xd8\...<hex data here>...\xff\xd9'
Bytes in image data: 3850
这是我的代码
class MySocket:
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def myclose(self):
self.sock.close()
def mysend(self, msg, debug=False):
if debug:
print("MESSAGE SENT")
print(msg.decode())
self.sock.sendall(msg)
def myreceive(self, debug=False):
received = b''
buffer = 1
while True:
part = self.sock.recv(buffer)
received += part
if part == b'':
break
if debug:
print("Received...")
print(received)
return received
def download_image(img_url):
"""
Download images with the given socket and list of urls
:param img_url: url corresponding to an image
:return: None
"""
image_socket = MySocket()
image_socket.connect("www.rit.edu", 80)
message = "GET " + img_url + " HTTP/1.1\r\n" \
"Host: www.rit.edu\r\n" \
"Accept: image/webp,image/apng,image/*,*/*;q=0.8\r\n" \
"Accept-Language: en-US,en;q=0.9\r\n" \
"Accept-Encoding: gzip, deflate\r\n\r\n"
image_socket.mysend(message.encode(), True)
reply = image_socket.myreceive()
print("ENTIRE MESSAGE RECEIVED")
print(reply)
print()
headers = reply.split(b'\r\n\r\n')[0]
print("RESPONSE HEADERS SPLIT OFF")
print(headers.decode())
image = reply[len(headers)+4:]
print()
print("IMAGE BINARY DATA SPLIT OFF")
print(image)
print()
print("Bytes in image data:", sys.getsizeof(image))
print()
# print(type(image))
img_name = str(len(os.listdir("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images"))) + img_url[-4:]
f = open(os.path.join("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images", img_name), 'wb')
f.write(image)
f.close()
def main():
download_image("http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg")
main()
谁能告诉我这是怎么回事,为什么 jpg 在第一次尝试时没有下载?
这是您发送的请求的一部分:
Accept: image/webp,image/apng,image/*,*/*;q=0.8
它表明您希望在 image/webp
内容类型之前获得任何其他 image/*
类型的响应。因此,您会在响应中获得 WEBP 图像:
HTTP/1.1 200 OK
...
Content-Length: 2564
...
Content-Type: image/webp
...
b'RIFF\xfc\t\...<hex data here>...\x00\x00'
下次您发送相同的请求时,您会得到不同的响应:
HTTP/1.1 200 OKheaders
...
Content-Length: 3833
...
Content-Type: image/jpeg
...
b'\xff\xd8\...<hex data here>...\xff\xd9'
这次您得到的不是 WEBP 图像,而是 JPEG 图像,可以在 Content-Type
header 和响应 body 中看到。
我不完全确定为什么会这样,但我假设来自 Chrome 的先前请求使服务器从原始源文件创建 JPEG 图像并将其缓存在本地以供以后请求使用,以便现在服务器提供 pre-created JPEG 文件而不是新创建 WEBP 文件的成本更低。并且您的 Accept
header 表示您支持这两种格式。
无论如何,如果您的代码不支持 WEBP 而只支持 JPEG,那么您不应该声称能够在您的 Accept
header 中处理 WEBP。相反,您应该只声明您真正支持的内容,即
Accept: image/jpeg
您在请求中发送的其他信息也是如此。例如,您声称通过发送 Accept-Encoding: gzip, deflate
来支持压缩响应,但您的代码不支持处理压缩响应。类似地,您声称能够通过发送 HTTP/1.1
请求来处理分块传输编码和 HTTP 保持活动状态,但您的代码也不支持任何这些功能。
总而言之,您应该只发送此请求以获得您想要的内容:
GET /.... HTTP/1.0
Host: www.rit.edu
Accept: image/jpeg