Python 套接字通过 HTTP 下载 jpg

Python Sockets Download jpg over HTTP

我在 Python 脚本中看到非常奇怪的行为。我正在使用 Python 套接字从网络上下载图像。我对使用 requests/urllib 不感兴趣。当我尝试下载图像时,它下载成功。但是,当要在照片应用程序中打开文件时,Windows 返回一个 "It looks like we don't support this file format" 错误。

这就是奇怪的部分开始的地方。如果我复制并粘贴我的套接字连接到的 URL(用于下载图像的那个,在本例中为 http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg)并自己从 Chrome 下载它,然后 运行又是我的脚本,图片下载显示没问题! HTTP 响应 header 中 Content-Length 的数量也增加了。我用 3 张不同的图像做了 3 次,每次都给了我相同的行为。下面是我的脚本的两个 运行,一个是我从 Chrome 下载文件之前,一个是之后。请注意,在第一个 运行 中,Content-Length header 指出响应的 body 中有 2564 个字节。在第二个 运行 中,这个数字变为 3833。他们都在请求相同的 URL。

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:58:24 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nAccept-Ranges: bytes\r\nLast-Modified: Sun, 12 Aug 2018 02:06:23 GMT\r\nX-Original-Content-Length: 25378\r\nX-Content-Type-Options: nosniff\r\nExpires: Sun, 12 Aug 2018 02:11:23 GMT\r\nCache-Control: max-age=300,private\r\nContent-Length: 2564\r\nConnection: close\r\nContent-Type: image/webp\r\n\r\nRIFF\xfc\t\...<hex data here>...\x00\x00'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:58:24 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
Accept-Ranges: bytes
Last-Modified: Sun, 12 Aug 2018 02:06:23 GMT
X-Original-Content-Length: 25378
X-Content-Type-Options: nosniff
Expires: Sun, 12 Aug 2018 02:11:23 GMT
Cache-Control: max-age=300,private
Content-Length: 2564
Connection: close
Content-Type: image/webp

IMAGE BINARY DATA SPLIT OFF
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

Bytes in image data: 2581

PS D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\Script> python .\hw3-script.py
MESSAGE SENT
GET /gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//xAbuaitah.jpg.pagespeed.ic.PFwk87Pcno.jpg HTTP/1.1
Host: www.rit.edu
Accept: image/webp,image/apng,image/*,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate


ENTIRE MESSAGE RECEIVED
b'HTTP/1.1 200 OK\r\nDate: Sun, 12 Aug 2018 04:59:08 GMT\r\nServer: Apache\r\nLink: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"\r\nX-Content-Type-Options: nosniff\r\nAccept-Ranges: bytes\r\nExpires: Mon, 12 Aug 2019 04:58:50 GMT\r\nCache-Control: max-age=31536000\r\nEtag: W/"0"\r\nLast-Modified: Sun, 12 Aug 2018 04:58:50 GMT\r\nX-Original-Content-Length: 25378\r\nContent-Length: 3833\r\nConnection: close\r\nContent-Type: image/jpeg\r\n\r\n\xff\xd8\...<hex data here>...\xff\xd9'

RESPONSE HEADERS SPLIT OFF
HTTP/1.1 200 OK
Date: Sun, 12 Aug 2018 04:59:08 GMT
Server: Apache
Link: <http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg>; rel="canonical"
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Expires: Mon, 12 Aug 2019 04:58:50 GMT
Cache-Control: max-age=31536000
Etag: W/"0"
Last-Modified: Sun, 12 Aug 2018 04:58:50 GMT
X-Original-Content-Length: 25378
Content-Length: 3833
Connection: close
Content-Type: image/jpeg

IMAGE BINARY DATA SPLIT OFF
b'\xff\xd8\...<hex data here>...\xff\xd9'

Bytes in image data: 3850

这是我的代码

class MySocket:

    def __init__(self, sock=None):
        if sock is None:
            self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        else:
            self.sock = sock

    def connect(self, host, port):
        self.sock.connect((host, port))

    def myclose(self):
        self.sock.close()

    def mysend(self, msg, debug=False):
        if debug:
            print("MESSAGE SENT")
            print(msg.decode())
        self.sock.sendall(msg)

    def myreceive(self, debug=False):
        received = b''
        buffer = 1
        while True:
            part = self.sock.recv(buffer)
            received += part
            if part == b'':
                break
        if debug:
            print("Received...")
            print(received)
        return received

def download_image(img_url):
    """
    Download images with the given socket and list of urls
    :param img_url: url corresponding to an image
    :return: None
    """
    image_socket = MySocket()
    image_socket.connect("www.rit.edu", 80)
    message = "GET " + img_url + " HTTP/1.1\r\n" \
              "Host: www.rit.edu\r\n" \
              "Accept: image/webp,image/apng,image/*,*/*;q=0.8\r\n" \
              "Accept-Language: en-US,en;q=0.9\r\n" \
              "Accept-Encoding: gzip, deflate\r\n\r\n"
    image_socket.mysend(message.encode(), True)
    reply = image_socket.myreceive()
    print("ENTIRE MESSAGE RECEIVED")
    print(reply)
    print()
    headers = reply.split(b'\r\n\r\n')[0]

    print("RESPONSE HEADERS SPLIT OFF")
    print(headers.decode())
    image = reply[len(headers)+4:]
    print()

    print("IMAGE BINARY DATA SPLIT OFF")
    print(image)
    print()
    print("Bytes in image data:", sys.getsizeof(image))
    print()
    # print(type(image))
    img_name = str(len(os.listdir("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images"))) + img_url[-4:]
    f = open(os.path.join("D:\Documents\School\RIT\Classes\Summer 2018\CSEC 380\Homework\3\Script\act1step2images", img_name), 'wb')
    f.write(image)
    f.close()

def main():
    download_image("http://www.rit.edu/gccis/computingsecurity/sites/rit.edu.gccis.computingsecurity/files//Abuaitah.jpg")

main()

谁能告诉我这是怎么回事,为什么 jpg 在第一次尝试时没有下载?

这是您发送的请求的一部分:

Accept: image/webp,image/apng,image/*,*/*;q=0.8

它表明您希望在 image/webp 内容类型之前获得任何其他 image/* 类型的响应。因此,您会在响应中获得 WEBP 图像:

HTTP/1.1 200 OK
...
Content-Length: 2564
...
Content-Type: image/webp
...
b'RIFF\xfc\t\...<hex data here>...\x00\x00'

下次您发送相同的请求时,您会得到不同的响应:

HTTP/1.1 200 OKheaders
...
Content-Length: 3833
...
Content-Type: image/jpeg
...
b'\xff\xd8\...<hex data here>...\xff\xd9'

这次您得到的不是 WEBP 图像,而是 JPEG 图像,可以在 Content-Type header 和响应 body 中看到。

我不完全确定为什么会这样,但我假设来自 Chrome 的先前请求使服务器从原始源文件创建 JPEG 图像并将其缓存在本地以供以后请求使用,以便现在服务器提供 pre-created JPEG 文件而不是新创建 WEBP 文件的成本更低。并且您的 Accept header 表示您支持这两种格式。

无论如何,如果您的代码不支持 WEBP 而只支持 JPEG,那么您不应该声称能够在您的 Accept header 中处理 WEBP。相反,您应该只声​​明您真正支持的内容,即

Accept: image/jpeg

您在请求中发送的其他信息也是如此。例如,您声称通过发送 Accept-Encoding: gzip, deflate 来支持压缩响应,但您的代码不支持处理压缩响应。类似地,您声称能够通过发送 HTTP/1.1 请求来处理分块传输编码和 HTTP 保持活动状态,但您的代码也不支持任何这些功能。

总而言之,您应该只发送此请求以获得您想要的内容:

GET /.... HTTP/1.0
Host: www.rit.edu
Accept: image/jpeg