解压缩 .gz 文件并将它们存储在 .tar.gz 存档中

Question

我遇到以下问题：我正在编写一个函数来查找一堆 .gz 文件，解压缩它们，并将单独解压缩的文件存储在更大的 .tar.gz 存档中。到目前为止，我设法用下面的代码实现了它，但是手动计算未压缩的文件大小并设置 TarInfo 大小似乎很老套，我想知道是否有更惯用的解决方案来解决我的问题：

import gzip
import os
import pathlib
import tarfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
    with tarfile.open(output_file, 'w:gz') as tar:
        for input_file in input_dir.glob('*.gz'):
            with gzip.open(input_file) as fd:
                tar_info = tarfile.TarInfo(input_file.stem)
                tar_info.size = fd.seek(0, os.SEEK_END)
                fd.seek(0, os.SEEK_SET)
                tar.addfile(tar_info, fd)

我试图通过以下方式创建一个 TarInfo 对象，而不是手动创建它：

tar_info = tar.gettarinfo(arcname=input_file.stem, fileobj=fd)

然而，这个函数会检索我们打开的原始.gz文件的路径作为fd来计算其大小，因此只提供一个tar_info.size参数对应于压缩.gz 数据而不是未压缩的数据，这不是我想要的。根本不设置 tar_fino.size 参数也不起作用，因为 addfile 在传递文件描述符时使用所述大小。

是否有更好、更惯用的方法来实现这一点，或者我是否坚持使用当前的解决方案？

Answer 1

您的方法是避免将文件完全解压缩到磁盘或 RAM 的唯一方法。毕竟，您需要提前知道大小才能添加到 tar 文件，而 gzip 文件并不知道自己的解压大小。 The ISIZE header field理论上提供的是解压后的大小，但是这个字段在32位时代就定义好了，所以实际上是大小模2**32；一个最初大小为 4 GB 的文件和一个大小为 0 B 的文件将具有相同的 ISIZE。无论如何，Python 不会公开 ISIZE，所以即使它有用，也没有内置的方法来执行此操作（您总是可以手动解析，但这并不完全干净或惯用语）。

如果你想避免两次解压文件（一次向seek转发，一次实际添加到tar文件），以解压到磁盘为代价，你可以使用 tempfile.TemporaryFile 来避免双重解压缩（无需将原始文件存储在内存中）并稍作调整：

import shutil
import tempfile

def gather_compressed_files(input_dir: pathlib.Path, output_file: str):
    with tarfile.open(output_file, 'w:gz') as tar:
        for input_file in input_dir.glob('*.gz'):
            with tempfile.TemporaryFile() as tf:
                # Could combine both in one with, but this way we close the gzip
                # file ASAP
                with gzip.open(input_file) as fd:
                    shutil.copyfileobj(fd, tf)
                tar_info = tarfile.TarInfo(input_file.stem)
                tar_info.size = tf.tell()
                tf.seek(0)
                tar.addfile(tar_info, tf)

解压缩 .gz 文件并将它们存储在 .tar.gz 存档中

Uncompressing .gz files and storing them in a .tar.gz archive

python

gzip

tarfile

python-3.x