在 16GB RAM 计算机上读取无内存错误的大型语言语料库

Question

我发现 Google NMT 使用编解码器读取输入数据文件。

import codecs
import tensorflow as tf
with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    return f.read().splitlines()

我有两个问题。

以上是否支持在使用 tf.gfile.GFile 的 16GB RAM 个人计算机中读取 size more than 5 GB 左右的巨大数据集而不会出现内存错误？我真的很感激能帮助我阅读大量语言语料库的解决方案

without getting the Memory error

。 2. 我已经在代码中导入了编解码器，但为什么会出现此错误 "NameError: name 'codecs' is not defined"？

编辑 1：

对于 2. 获取

 OutOfRangeError                           Traceback (most recent call last)
    <ipython-input-7-e78786c1f151> in <module>()
          6 input_file = os.path.join(source_path)
          7 with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    ----> 8     source_text = f.read().splitlines()

OutOfRangeError 在操作迭代超过有效输入范围时引发。我怎样才能解决这个问题？

Answer 1

如果文件很大，建议逐行处理。下面的代码可以解决问题：

with open("input_file") as infile:
    for line in infile:
        do_something_with(line)

在 16GB RAM 计算机上读取无内存错误的大型语言语料库

Reading a big language corpus without Memory Error in 16GB RAM computer

nlp

bigdata

python-3.x

machine-translation

tensorflow