从 Python 中的 gzip 文本文件中读取行并获取读取的原始压缩字节数

Question

我有很多 gzip 文本文件，我想解压并即时（在线）读取和处理，这样我就可以节省磁盘 space 并以牺牲在线解压的时间为代价从磁盘读取数据.

所以我使用 gzip 模块以及 tqdm 来跟踪进度。

但是我怎样才能找出原始未压缩文件的大小，以便在完成之前设置要读取的总字节数（未压缩）以跟踪进度？就我在网上搜索而言，对于大于 4 GB 的文件，这个问题很难在 gzip 中解决，这就是我的情况。

或者我应该跟踪读取的压缩字节数，将总字节数设置为压缩文件的大小。

我怎样才能做到这一点？

下面是代码示例，其中的注释也反映了我正在努力实现的目标。

我正在使用 Python 3.5 .

import gzip
import tqdm
import os

size = os.path.getsize('filename.gz')
pbar = tqdm.tqdm(total=size, unit='b', unit_scale=True, unit_divisor=1024)

with gzip.open('filename.gz', 'rt') as file:
    for line in file:
        bytes_uncompressed = len(line.encode('utf-8'))
        # but how can I get compressed bytes read count?
        # bytes_compressed = ...?

        # pbar.update(bytes_compressed)

Answer 1

您应该打开以读取基础文件（二进制模式）f = open('filename.gz', 'rb')。然后在其上打开 gzip 文件。 g = gzip.GzipFile(fileobj=f)。您从 g 执行读取操作，并告诉您有多远，您 cat f.tell() 请求在压缩文件中的位置。

编辑 2：顺便说一句。当然，您也可以在 GzipFile 实例上使用 tell() 来查看您的未压缩文件的长度（读取的字节数）。

编辑：现在我看到这只是对您问题的部分回答。你还需要总数。恐怕你有点不走运。特别是如您所述，对于超过 4GB 的文件。 gzip 在最后四个字节中保留未压缩的大小，因此您可以跳到那里读取它们并跳回（GzipFile 本身似乎没有公开此信息），但由于它是四个字节，您只能存储 4GB 作为最大的数字，其余的只是被截断到值的较低 4B。那样的话，恐怕要走到最后你才知道。

无论如何，上面的提示为您提供了压缩和未压缩的当前位置，希望这能让您至少在一定程度上实现您的目标。

Answer 2

你的问题已经有了答案。不要跟踪未压缩字节的进度。跟踪压缩字节的进度。对于自洽的压缩文件，它们彼此大致成比例，因此您将获得相同的效果。很容易找到压缩文件的大小。

Answer 3

这是我所做的：

import gzip
import tqdm
import os


def _reader_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)

def raw_newline_count_gzip(fname):
    f = gzip.open(fname, 'rb')
    f_gen = _reader_generator(f.read)
    return sum(buf.count(b'\n') for buf in f_gen)

num = raw_newline_count_gzip('filename.gz')

with gzip.open('filename.gz', 'rt') as file:
    with tqdm(total=num) as pbar:
        for line in file:
            bytes_uncompressed = len(line.encode('utf-8'))
            # do whatever you want

            pbar.update(1)

希望这适用于您的文件。

Answer 4

在尝试自己实现之后，我找到了简单的解决方案（文档中没有明确说明）。您可以在以文本打开时使用 gzippedfile.buffer.fileobj 访问基础文件对象，在以二进制文件打开时使用 gzippedfile.fileobj 访问基础文件对象。

如果您遍历文件，使用 tell() 的光标位置将是从磁盘读取的字节数。

请参阅 textIO wrapper doc for buffer usage and the gzip doc fileobj

对于你的情况，你可以这样做：

with open('filename.gz', 'rt') as file:
    for line in file:
        pbar.update(file.buffer.fileobj.tell() - pbar.n)   # tqdm uses incremental update, so 
                                                   # increment is (current - last value)
        # Do things

这里是@Mark Adler 建议的示例实现，如果您确实需要访问二进制文件

with open('filename.gz', 'rb') as f, gzip.open(f, 'rt') as file:
    for line in file:
        pbar.n = f.tell()  # Another way to set progress when we know total progress rather than increment
        pbar.update(0)   # Call refresh if needed
        # Do things

从 Python 中的 gzip 文本文件中读取行并获取读取的原始压缩字节数

Reading lines from gzipped text file in Python and get number of original compressed bytes read

python

gzip

filesize