运行逐行写入文件时内存不足 [Python]

Question

我有一些大数据的数据处理任务。我运行 EC2 上的脚本使用 Python 看起来像下面这样：

with open(LARGE_FILE, 'r') as f:
    with open(OUTPUT_FILE, 'w') as out:
        for line in f:
            results = some_computation(line)
            out.write(json.dumps(results))
            out.write('\n')

我逐行遍历数据并将结果逐行写入另一个文件。

运行宁了几个小时后，我无法登录服务器。我必须重新启动实例才能继续。

$ ssh ubuntu@$IP_ADDRESS
ssh_exchange_identification: read: Connection reset by peer

服务器可能运行RAM 不足。写入文件时，RAM 缓慢爬升。我不确定为什么逐行读写时内存会成为问题。

我有足够的硬盘space。

我认为最接近这个问题：Does the Python "open" function save its content in memory or in a temp file?

Answer 1

我正在使用 SpaCy 对文本进行一些预处理。看起来使用分词器会导致内存稳定增长。

https://github.com/spacy-io/spaCy/issues/285

运行逐行写入文件时内存不足 [Python]

Running out of RAM when writing to a file line by line [Python]

python

file-io

amazon-ec2

amazon-web-services

spacy

运行 逐行写入文件时内存不足 [Python]

Running out of RAM when writing to a file line by line [Python]

python

file-io

amazon-ec2

amazon-web-services

spacy

运行逐行写入文件时内存不足 [Python]