使用 python 读取 .gz 文件的内容

Question

我是 Python 的新手，运行遇到读取 .gz 文件内容的问题：

我有一个充满 .gz 文件的文件夹，这些文件是我使用私有 API 以编程方式提取的。每个 .gz 文件的内容都是一个 .xml 文件，所以我需要遍历目录并提取它们。

问题是当我以编程方式将这些 .gz 文件提取到它们各自的 .xml 版本时...文件创建时没有错误，当我打开一个文件（使用 TextWrangler）时，它看起来像一个普通的 .xml 文件，但当我在十六进制编辑器中查看它时却没有。此外，当我以编程方式打开 .xml 文件并打印其内容时，它显示为一堆（二进制？）混乱的文本。

考虑到上述情况，如果我手动提取其中一个文件（即：使用 OSX，而不是 Python），则可以在十六进制编辑器中查看该文件期待它。

这是我的代码片段（未显示适当的导入，但它们是 glob 和 gzip）：

searchpattern = siteid + "_" + resource + "_*.gz"
for infile in glob.glob(workingDir + searchpattern):
    print infile

    #read the zipped contents  (https://docs.python.org/2/library/gzip.html)
    f = gzip.open(infile, 'rb')
    file_content = f.read()
    file_content = str(file_content) #This was an attempt to fix
    print file_content #  This shows a bunch of mumbo jumbo

    #write the contents we just read to a new file (uncompressed)
    newfilename = infile[0:-3] # the filename without the ".gz"
    newfilename = newfilename + ".xml"
    fnew = open(newfilename, 'w+b')
    fnew.write(str(file_content))
    fnew.close()

    #delete the .gz version of the file
    #os.remove(infile)

Answer 1

如果我运行这个反对 XML 我没有遇到任何问题。

如果我用这个程序压缩和 XML 并提取它，并将原始文件与这个程序的输出进行比较，我没有发现任何差异。

此程序确实添加了一个额外的“.xml”扩展名。

Answer 2

所以这对我来说是一个愚蠢的错误，但我会 post 这是对其他犯过同样错误的人的后续行动。

问题是我正在压缩我的程序中已经压缩过的内容。所以考虑到这一点，我在这个线程上的代码片段没有任何问题。我创建 .gz 文件的代码也没有（技术上）。正如您在下面看到的。正常打开文件，而不是在程序的前面使用 gzip 库就可以了。

    #Download and write the contents of each response to a .gz file
    if limitCounter < limit or int(limit) == 0:
        print _name + "  " + scopeStartDate + " through " + scopeEndDate + " at " + href
        file = api.get(href)
        gz_file_content = file.content
        #gz_file = gzip.open(workingDir + _name, "wb") # This breaks the program later
        gz_file = open(workingDir + _name, 'wb') # This works.
        gz_file.write(gz_file_content)
        gz_file.close()

使用 python 读取 .gz 文件的内容

Read contents of .gz file with python

gzip