读取 .tar 文件中的 .gz 文件而不解压

Read .gz files inside .tar files without extracting

我有一个 .tar 文件,在一个文件夹中包含许多 .gz 文件。这些 gz 文件中的每一个都包含一个 .txt 文件。与此问题相关的其他Whosebug问题旨在提取文件。

我正在尝试迭代读取每个 .txt 文件的内容而不提取它们,因为 .tar 很大。

首先我阅读了.tar文件的内容:

import tarfile
tar = tarfile.open("FILE.tar")
tar.getmembers()

或者在 Unix 中:

tar xvf file.tar -O

然后我尝试使用 tarfile extractfile 方法,但出现错误:“模块 'tarfile' 没有属性 'extractfile'”。此外,我什至不确定这是正确的方法。

import gzip
for member in tar.getmembers():
    m = tarfile.extractfile(member)
    file_contents = gzip.GzipFile(fileobj=m).read()

如果要创建示例文件来模拟原始文件:

$ mkdir directory
$ touch directory/file1.txt.gz directory/file2.txt.gz directory/file3.txt.gz
$ tar -c -f file.tar directory

这是在使用 Mark Adler 的建议后对我有用的最终版本:

import tarfile
tar = tarfile.open("file.tar")
members = tar.getmembers()

# Here I append the results in a list, because I wasn't able to
# parse the tarfile type returned by .getmembers():
tar_name = []
for elem in members:
    tar_name.append(elem.name)

# Then I changed tarfile.extractfile to tar.extractfile as suggested: 
for member in tar_name:
    # I'm using this because I have other non-gzs in the directory
    if member.endswith(".gz"):    
        m=tar.extractfile(member)
        file_contents = gzip.GzipFile(fileobj=m).read()

这是 unix 行/bash 命令:

准备文件:

$ git clone https://github.com/githubtraining/hellogitworld.git
$ cd hellogitworld
$ gzip *
$ ls
build.gradle.gz  fix.txt.gz  pom.xml.gz  README.txt.gz  resources  runme.sh.gz  src
$ cd ..
$ tar -cf hellogitworld.tar hellogitworld/

查看自述文件的方法如下:

$ tar -Oxf hellogitworld.tar hellogitworld/README.txt.gz | zcat

结果:

This is a sample project students can use during Matthew's Git class.

Here is an addition by me

We can have a bit of fun with this repo, knowing that we can always reset it to a known good state.  We can apply labels, and branch, then add new code and merge it in to the master branch.

As a quick reminder, this came from one of three locations in either SSH, Git, or HTTPS format:

* git@github.com:matthewmccullough/hellogitworld.git
* git://github.com/matthewmccullough/hellogitworld.git
* https://matthewmccullough@github.com/matthewmccullough/hellogitworld.git

We can, as an example effort, even modify this README and change it as if it were source code for the purposes of the class.

This demo also includes an image with changes on a branch for examination of image diff on GitHub.

请注意,我与那些 git 存储库没有关联。

tar 的解释:

  • 标记-x = 提取
  • flag -O = 不要将文件写入文件系统,而是写入 STDOUT
  • flag -f = 指定一个文件

然后剩下的就是将结果通过管道传输到 zcat 以在 STDOUT 中查看未压缩的明文

您需要使用 tar.extractfile(member) 而不是 tarfile.extractfile(member)tarfileclass,不知道您打开的 tar 文件。 tar 是 tar 文件 对象 ,它引用了您打开的 .tar 文件。

为了正确执行,请使用 next() 而不是 getmembers()getnames(),这样您就不必两次阅读整个 tar 文件:

with tarfile.open(sys.argv[1]) as tar:
    while ent := tar.next():
        if ent.name.endswith(".gz"):
            print(gzip.GzipFile(fileobj=tar.extractfile(ent)).read())
import gzip
import tarfile

with tarfile.TarFile("data.tar", 'r') as tar_fd:
    for files in tar_fd.getnames():
        if files.endswith(".gz"):
            file = tar_fd.extractfile(files)
            file_content = gzip.GzipFile(fileobj=file).readline()
            print(file_content)