从 python 或其他基于 id 的巨大文本文件的多行创建单行
creating single line from mulitplelines from huge text file in python or others based on id
我有一个非常大的文本文件 (20GB),像这样的行,
1 Some text
1 More text
2 Text
2 Follow up text
..
..
n
我想将文件转换成这样:
1, sometext, more text
2, text , followup text
我该怎么做python。我无法将整个文件保存在内存中。
您可以使用itertools.groupby
按照以下方式做某事:
from itertools import groupby
# from itertools import groupby, imap # Python2 map returns a list
def tokens(line):
return [t.strip() for t in line.strip().split(' ', 1)]
with open('infile.txt', 'r') as fin, open('outfile.txt', 'w') as fout:
for k, g in groupby(map(tokens, fin), key=lambda t: t[0]):
# for k, g in groupby(imap(tokens, fin), key=lambda t: t[0]): # Py2
fout.write(', '.join([k] + [x[1] for x in g]) + '\n')
# not to be too silent
print('Processing id: ' + k)
我有一个非常大的文本文件 (20GB),像这样的行,
1 Some text
1 More text
2 Text
2 Follow up text
..
..
n
我想将文件转换成这样:
1, sometext, more text
2, text , followup text
我该怎么做python。我无法将整个文件保存在内存中。
您可以使用itertools.groupby
按照以下方式做某事:
from itertools import groupby
# from itertools import groupby, imap # Python2 map returns a list
def tokens(line):
return [t.strip() for t in line.strip().split(' ', 1)]
with open('infile.txt', 'r') as fin, open('outfile.txt', 'w') as fout:
for k, g in groupby(map(tokens, fin), key=lambda t: t[0]):
# for k, g in groupby(imap(tokens, fin), key=lambda t: t[0]): # Py2
fout.write(', '.join([k] + [x[1] for x in g]) + '\n')
# not to be too silent
print('Processing id: ' + k)