Python3 在 tar 个文件中处理 csv 文件
Python3 working with csv files in tar files
我正在尝试使用 tar.gz 文件中包含的 csv 文件,但我在将正确的 data/object 传递给 csv 模块时遇到问题。
假设我有一个 tar.gz 文件,其中包含许多格式如下的 csv 文件。
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
我希望能够访问内存中的每个 csv 文件,而无需从 tar 文件中提取每个文件并将它们写入磁盘。
例如:
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
content = csv.reader(f)
for row in content:
print(row)
tar.close()
这会产生以下错误。
for row in content:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
我还尝试按照 csv 模块文档中的描述将 f 解析为字符串。
content = csv.reader([f])
上面产生了同样的错误。
我已经尝试将文件对象 f 解析为 ascii。
f = tar.extractfile(member).read().decode('ascii')
但这会迭代每个 csv 元素,而不是迭代包含元素列表的行。
['1']
['0']
['7']
['9']
['', '']
['S']
['A']
['M']
['P']
['L']
['E']
['_']
['A']
['', '']
['G']
['R']
剪断...
['2']
['0']
['1']
['7']
['/']
['0']
['2']
['/']
['1']
['5']
[' ']
['2']
['2']
[':']
['5']
['7']
[':']
['3']
['8']
[]
[]
尝试将 f 解析为 ascii 并将其读取为字符串
f = tar.extractfile(member).read().decode('ascii')
content = csv.reader([f])
产生以下输出
for row in content:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
为了演示不同的输出,我使用了以下代码。
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
print(member.name)
print('Raw :', type(f))
print(f)
print()
f = f.decode('ascii')
print('ASCII:', type(f))
print(f)
tar.close()
这会产生以下输出。 (对于此示例,每个 csv 包含相同的数据)。
./raw_data/csv-file1.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file2.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file3.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
如何让 csv 模块正确读取 tar 模块提供的内存中的文件?
谢谢。
您只需要使用io.StringIO()
生成一个类似对象的文件供csv 库使用。例如:
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii'))
for row in csv.reader(csv_file):
print(row)
这个问题时隔近3年再次提出。请注意,在 中,经过简短的讨论可以找到更好的解决方案:
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.TextIOWrapper(tar.extractfile(member), encoding="utf-8")
for row in csv.reader(csv_file):
print(row)
TextIOWrapper 对于较大的文件性能更好,因为它不需要一次消耗完整的文件。相反,当tar.extractfile(member).read()
被执行时,完整的成员文件被加载到内存中。
我正在尝试使用 tar.gz 文件中包含的 csv 文件,但我在将正确的 data/object 传递给 csv 模块时遇到问题。
假设我有一个 tar.gz 文件,其中包含许多格式如下的 csv 文件。
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
我希望能够访问内存中的每个 csv 文件,而无需从 tar 文件中提取每个文件并将它们写入磁盘。 例如:
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
content = csv.reader(f)
for row in content:
print(row)
tar.close()
这会产生以下错误。
for row in content:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
我还尝试按照 csv 模块文档中的描述将 f 解析为字符串。
content = csv.reader([f])
上面产生了同样的错误。
我已经尝试将文件对象 f 解析为 ascii。
f = tar.extractfile(member).read().decode('ascii')
但这会迭代每个 csv 元素,而不是迭代包含元素列表的行。
['1']
['0']
['7']
['9']
['', '']
['S']
['A']
['M']
['P']
['L']
['E']
['_']
['A']
['', '']
['G']
['R']
剪断...
['2']
['0']
['1']
['7']
['/']
['0']
['2']
['/']
['1']
['5']
[' ']
['2']
['2']
[':']
['5']
['7']
[':']
['3']
['8']
[]
[]
尝试将 f 解析为 ascii 并将其读取为字符串
f = tar.extractfile(member).read().decode('ascii')
content = csv.reader([f])
产生以下输出
for row in content:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
为了演示不同的输出,我使用了以下代码。
import tarfile
import csv
tar = tarfile.open("tar-file.tar.gz")
for member in tar.getmembers():
f = tar.extractfile(member).read()
print(member.name)
print('Raw :', type(f))
print(f)
print()
f = f.decode('ascii')
print('ASCII:', type(f))
print(f)
tar.close()
这会产生以下输出。 (对于此示例,每个 csv 包含相同的数据)。
./raw_data/csv-file1.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file2.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
./raw_data/csv-file3.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'
ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38
如何让 csv 模块正确读取 tar 模块提供的内存中的文件? 谢谢。
您只需要使用io.StringIO()
生成一个类似对象的文件供csv 库使用。例如:
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii'))
for row in csv.reader(csv_file):
print(row)
这个问题时隔近3年再次提出。请注意,在
import tarfile
import csv
import io
with tarfile.open('input.rar') as tar:
for member in tar:
if member.isreg(): # Is it a regular file?
print("{} - {} bytes".format(member.name, member.size))
csv_file = io.TextIOWrapper(tar.extractfile(member), encoding="utf-8")
for row in csv.reader(csv_file):
print(row)
TextIOWrapper 对于较大的文件性能更好,因为它不需要一次消耗完整的文件。相反,当tar.extractfile(member).read()
被执行时,完整的成员文件被加载到内存中。