处理巨大的 bz2 文件

Question

我应该使用 python 处理一个巨大的 bz2 文件（5+ GB）。使用我的实际代码，我总是会遇到内存错误。在某个地方，我读到我可以使用 sqlite3 来处理这个问题。这是正确的吗？如果是，我应该如何调整我的代码？（我对sqlite3的使用不是很熟练...）

这是我实际的代码开头：

import csv, bz2

names = ('ID', 'FORM')

filename = "huge-file.bz2"

with open(filename) as f:
    f = bz2.BZ2File(f, 'rb')
    reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
    tokens = [sentence for sentence in reader]

在此之后，我需要通过 'tokens'。如果我能处理这个巨大的 bz2 文件，那就太好了——所以，非常非常欢迎任何帮助！非常感谢您的任何建议！

Answer 1

文件很大，无法读取所有文件，因为您的进程将运行内存不足。

解决方案是读取chunks/lines中的文件，并在读取下一个块之前处理它们。

列表理解线

tokens = [sentence for sentence in reader]

正在将整个文件读取到 tokens，这可能会导致进程运行内存不足。

csv.DictReader可以逐行读取CSV记录，意思是每次迭代，1行数据将加载到内存中。

像这样：

with open(filename) as f:
    f = bz2.BZ2File(f, 'rb')
    reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
    for sentence in reader:
       # do something with sentence (process/aggregate/store/etc.)
       pass

请注意，如果在添加的循环中，来自 sentence 的数据被存储在另一个变量（如 tokens）中，仍然会消耗大量内存，具体取决于变量的大小数据。所以最好将它们聚合起来，或者为该数据使用其他类型的存储。

更新

关于在您的流程中使用之前的一些行（如评论中所讨论的），您可以这样做：

然后您可以将上一行存储在另一个变量中，该变量在每次迭代时都会被替换。

或者如果您需要多行（返回），那么您可以获得最后 n 行的列表。

如何

使用 collections.deque 和 maxlen 来跟踪最后的 n 行。从文件顶部的 collections 标准模块导入 deque。

from collections import deque

# rest of the code ...

last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
    # process the sentence
    last_sentences.append(sentence)

我建议采用上述解决方案，但您也可以使用列表自己实现，并手动跟踪其大小。

在循环之前定义一个空列表，在循环结束时检查列表的长度是否大于您需要的长度，从列表中删除旧的项目，然后追加当前行。

last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
    # process the sentence
    if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
        last_sentences = last_sentences[-5:]
    last_sentences.append(sentence)

处理巨大的 bz2 文件

Handle huge bz2-file

python

csv

sqlite

linguistics

bzip2

更新

如何