使用内存中的单个文件提取 bz2 文件

Question

我有一个压缩成 bz2 文件的 csv 文件，我正尝试通过

从网站加载、解压缩并写入本地 csv 文件

# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)

在解压缩调用中，我得到 IOError: invalid data stream。请注意，存档中包含的 csv 文件包含相当多的字符，可能会导致一些问题。特别是，如果我尝试将文件内容放入 unicode，我会收到无法解码 0xfd 的错误消息。我在存档中只有一个文件，但我想知道是否由于未提取特定文件而导致某些事情发生。

有什么想法吗？

Answer 1

我怀疑您收到此错误是因为您为 decompress() 函数提供的流不是有效的 bz2 流。

您还必须 "rewind" 您的 StringIO 缓冲区写入后。请参阅下面的注释中的注释。如果 URL 指向有效的 bz2 文件，则以下代码（与您的代码相同，除了导入和 seek() 修复）有效。

from StringIO import StringIO
import urllib2
import bz2

# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2"  # just an example bz2 file

archive = StringIO()

# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)

archive.write(url_data.read())

# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()

# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)

回复：编码问题

通常，字符编码错误会生成 UnicodeError（或其表兄弟之一），但不会生成 IOError。 IOError 表明输入有问题，例如截断，或一些会阻止解压缩程序完全完成其工作的错误。

您省略了问题中的导入内容，StringIO 和 cStringIO 之间的细微差别之一（根据 docs ）是 cStringIO无法使用无法转换为 ascii 的 unicode 字符串。这似乎不再成立（至少在我的测试中），但它可能在起作用。

Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

使用内存中的单个文件提取 bz2 文件

Extracting bz2 file with single file in memory

python

csv

stringio

bz2