在 python 中从 s3 流式传输 gzip 文件

Question

您好，我正在开发一个项目，目的是为了使用常见的爬网数据，我有一个来自 here

的最新爬网 warc 文件路径的子集

所以基本上我有一个像 https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/warc/CC-MAIN-20201019145901-20201019175901-00000.warc.gz 这样的 url（warc 路径中的第一个 url）并且我在请求中像这样流式传输：

s = requests.Session()

resp = s.get(url, headers=headers, stream=True)
print(resp.status_code)
for line in stream_gzip_decompress(resp):
     print(line.decode('utf-8'))

def stream_gzip_decompress(stream):
   dec = zlib.decompressobj( 32+ zlib.MAX_WBITS)  # offset 32 to skip the header
   for chunk in stream:
      rv = dec.decompress(chunk)
      if rv:
          yield rv

stream_gzip_decompress 来自 Python unzipping stream of bytes?

前三个块似乎可以很好地解压缩并打印出来，然后脚本就永远挂起（我只等了大约 8 分钟。它似乎仍然是运行通过块但被 if rv: 行不会产生任何结果，但似乎仍在以字节为单位进行流式传输。

Answer 1

为什么不使用 WARC 解析器库（我推荐 warcio）来进行包括 gzip 解压缩在内的解析？

或者，查看 gzipstream 从 gzip 内容流中读取并即时解压缩数据。

在 python 中从 s3 流式传输 gzip 文件

Streaming in a gzipped file from s3 in python

python

gzip

zlib

common-crawl