从 python 或其他基于 id 的巨大文本文件的多行创建单行

Question

我有一个非常大的文本文件 (20GB)，像这样的行，

1 Some text
1 More text
2 Text 
2 Follow up text
..
..
n

我想将文件转换成这样：

1, sometext, more text
2, text , followup text

我该怎么做python。我无法将整个文件保存在内存中。

Answer 1

您可以使用itertools.groupby按照以下方式做某事：

from itertools import groupby
# from itertools import groupby, imap  # Python2 map returns a list

def tokens(line):
  return [t.strip() for t in line.strip().split(' ', 1)]

with open('infile.txt', 'r') as fin, open('outfile.txt', 'w') as fout:
  for k, g in groupby(map(tokens, fin), key=lambda t: t[0]):
  # for k, g in groupby(imap(tokens, fin), key=lambda t: t[0]):  # Py2
    fout.write(', '.join([k] + [x[1] for x in g]) + '\n')
    # not to be too silent
    print('Processing id: ' + k)

从 python 或其他基于 id 的巨大文本文件的多行创建单行

creating single line from mulitplelines from huge text file in python or others based on id

python

unix

multiline

readlines