使用分块传输编码和 gzip 压缩的网页的未压缩大小
Uncompressed size of a webpage using chunked transfer encoding and gzip compression
我正在编写一个应用程序来计算我们在网页上使用 gzip 后节省的费用。当用户输入使用 gzip 的网页的 URL 时,应用程序应该吐出由于 gzip 而节省的大小。
我该如何解决这个问题?
这是我在 header 页面上的 GET 请求得到的结果:
{
'X-Powered-By': 'PHP/5.5.9-1ubuntu4.19',
'Transfer-Encoding': 'chunked',
'Content-Encoding': 'gzip',
'Vary': 'Accept-Encoding',
'Server': 'nginx/1.4.6 (Ubuntu)',
'Connection': 'keep-alive',
'Date': 'Thu, 10 Nov 2016 09:49:58 GMT',
'Content-Type': 'text/html'
}
我正在使用 requests
检索页面:
r = requests.get(url, headers)
data = r.text
print "Webpage size : " , len(data)/1024
在接受和不接受 gzip 压缩的情况下发送 HEAD 请求,并比较 Content-Length header 结果。
'accept-encoding' header 帮助您请求 gzip 压缩:
'accept-encoding': 'gzip'
在这种情况下,请求没有 gzip 编码。
'accept-encoding': ''
发送 HEAD 请求将由 requests 库轻松处理:
import requests
r = requests.head("http://whosebug.com/", headers={'Accept-Encoding': 'gzip'})
print(r.headers['content-length'])
41450
r = requests.head("http://whosebug.com/", headers={'Accept-Encoding': ''})
print(r.headers['content-length'])
250243
如果您已经下载了 URL(使用不带 stream
选项的 requests
GET
请求,您已经拥有两种尺寸,因为整个响应是下载解压,原长度在headers:
from __future__ import division
r = requests.get(url, headers=headers)
compressed_length = int(r.headers['content-length'])
decompressed_length = len(r.content)
ratio = compressed_length / decompressed_length
您 可以 将 Accept-Encoding: identity
HEAD 请求 content-length header 与设置为 Accept-Encoding: gzip
的请求进行比较:
no_gzip = {'Accept-Encoding': 'identity'}
no_gzip.update(headers)
uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length'])
force_gzip = {'Accept-Encoding': 'gzip'}
force_gzip.update(headers)
compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length'])
但是,这可能不适用于所有服务器,因为 dynamically-generated 内容服务器通常会在这种情况下对 Content-Length header 进行 futz,以避免必须首先呈现内容。
如果您正在请求 chunked transfer encoding 资源,则不会 是 content-length header,在这种情况下是 HEAD 请求可能会也可能不会为您提供正确的信息。
在这种情况下,您 必须 流式传输整个响应并从流的末尾提取解压缩的大小(GZIP 格式将其作为 little-endian 4 字节 unsigned int 在最后)。在原始 urllib3 响应上使用 stream()
method object:
import requests
from collections import deque
if hasattr(int, 'from_bytes'):
# Python 3.2 and up
_extract_size = lambda q: int.from_bytes(bytes(q), 'little')
else:
import struct
_le_int = struct.Struct('<I').unpack
_extract_size = lambda q: _le_int(b''.join(q))[0]
def get_content_lengths(url, headers=None, chunk_size=2048):
"""Return the compressed and uncompressed lengths for a given URL
Works for all resources accessible by GET, regardless of transfer-encoding
and discrepancies between HEAD and GET responses. This does have
to download the full request (streamed) to determine sizes.
"""
only_gzip = {'Accept-Encoding': 'gzip'}
only_gzip.update(headers or {})
# Set `stream=True` to ensure we can access the original stream:
r = requests.get(url, headers=only_gzip, stream=True)
r.raise_for_status()
if r.headers.get('Content-Encoding') != 'gzip':
raise ValueError('Response not gzip-compressed')
# we only need the very last 4 bytes of the data stream
last_data = deque(maxlen=4)
compressed_length = 0
# stream directly from the urllib3 response so we can ensure the
# data is not decompressed as we iterate
for chunk in r.raw.stream(chunk_size, decode_content=False):
compressed_length += len(chunk)
last_data.extend(chunk)
if compressed_length < 4:
raise ValueError('Not enough data loaded to determine uncompressed size')
return compressed_length, _extract_size(last_data)
演示:
>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip')
>>> compressed_length
179
>>> decompressed_length
226
>>> compressed_length / decompressed_length
0.7920353982300885
我正在编写一个应用程序来计算我们在网页上使用 gzip 后节省的费用。当用户输入使用 gzip 的网页的 URL 时,应用程序应该吐出由于 gzip 而节省的大小。
我该如何解决这个问题?
这是我在 header 页面上的 GET 请求得到的结果:
{
'X-Powered-By': 'PHP/5.5.9-1ubuntu4.19',
'Transfer-Encoding': 'chunked',
'Content-Encoding': 'gzip',
'Vary': 'Accept-Encoding',
'Server': 'nginx/1.4.6 (Ubuntu)',
'Connection': 'keep-alive',
'Date': 'Thu, 10 Nov 2016 09:49:58 GMT',
'Content-Type': 'text/html'
}
我正在使用 requests
检索页面:
r = requests.get(url, headers)
data = r.text
print "Webpage size : " , len(data)/1024
在接受和不接受 gzip 压缩的情况下发送 HEAD 请求,并比较 Content-Length header 结果。
'accept-encoding' header 帮助您请求 gzip 压缩:
'accept-encoding': 'gzip'
在这种情况下,请求没有 gzip 编码。
'accept-encoding': ''
发送 HEAD 请求将由 requests 库轻松处理:
import requests
r = requests.head("http://whosebug.com/", headers={'Accept-Encoding': 'gzip'})
print(r.headers['content-length'])
41450
r = requests.head("http://whosebug.com/", headers={'Accept-Encoding': ''})
print(r.headers['content-length'])
250243
如果您已经下载了 URL(使用不带 stream
选项的 requests
GET
请求,您已经拥有两种尺寸,因为整个响应是下载解压,原长度在headers:
from __future__ import division
r = requests.get(url, headers=headers)
compressed_length = int(r.headers['content-length'])
decompressed_length = len(r.content)
ratio = compressed_length / decompressed_length
您 可以 将 Accept-Encoding: identity
HEAD 请求 content-length header 与设置为 Accept-Encoding: gzip
的请求进行比较:
no_gzip = {'Accept-Encoding': 'identity'}
no_gzip.update(headers)
uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length'])
force_gzip = {'Accept-Encoding': 'gzip'}
force_gzip.update(headers)
compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length'])
但是,这可能不适用于所有服务器,因为 dynamically-generated 内容服务器通常会在这种情况下对 Content-Length header 进行 futz,以避免必须首先呈现内容。
如果您正在请求 chunked transfer encoding 资源,则不会 是 content-length header,在这种情况下是 HEAD 请求可能会也可能不会为您提供正确的信息。
在这种情况下,您 必须 流式传输整个响应并从流的末尾提取解压缩的大小(GZIP 格式将其作为 little-endian 4 字节 unsigned int 在最后)。在原始 urllib3 响应上使用 stream()
method object:
import requests
from collections import deque
if hasattr(int, 'from_bytes'):
# Python 3.2 and up
_extract_size = lambda q: int.from_bytes(bytes(q), 'little')
else:
import struct
_le_int = struct.Struct('<I').unpack
_extract_size = lambda q: _le_int(b''.join(q))[0]
def get_content_lengths(url, headers=None, chunk_size=2048):
"""Return the compressed and uncompressed lengths for a given URL
Works for all resources accessible by GET, regardless of transfer-encoding
and discrepancies between HEAD and GET responses. This does have
to download the full request (streamed) to determine sizes.
"""
only_gzip = {'Accept-Encoding': 'gzip'}
only_gzip.update(headers or {})
# Set `stream=True` to ensure we can access the original stream:
r = requests.get(url, headers=only_gzip, stream=True)
r.raise_for_status()
if r.headers.get('Content-Encoding') != 'gzip':
raise ValueError('Response not gzip-compressed')
# we only need the very last 4 bytes of the data stream
last_data = deque(maxlen=4)
compressed_length = 0
# stream directly from the urllib3 response so we can ensure the
# data is not decompressed as we iterate
for chunk in r.raw.stream(chunk_size, decode_content=False):
compressed_length += len(chunk)
last_data.extend(chunk)
if compressed_length < 4:
raise ValueError('Not enough data loaded to determine uncompressed size')
return compressed_length, _extract_size(last_data)
演示:
>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip')
>>> compressed_length
179
>>> decompressed_length
226
>>> compressed_length / decompressed_length
0.7920353982300885