Python 延迟加载

Question

以下代码将逐行延迟打印文本文件的内容，每次打印都在 '/n' 处停止。

   with open('eggs.txt', 'rb') as file:
       for line in file:
           print line

是否有任何配置可以延迟打印文本文件的内容，每次打印都在 ', ' 处停止？

（或任何其他 character/string）

我问这个是因为我正在尝试读取一个文件，其中包含一个 2.9 GB 的长行，用逗号分隔。

PS。我的问题与这个不同：Read large text files in Python, line by line without loading it in to memory 我在问如何在换行符 ('\n')

以外的字符处停止

Answer 1

我认为没有内置的方法可以实现这一点。您将必须使用 file.read(block_size) 逐块读取文件，以逗号分隔每个块，并手动重新连接跨越块边界的字符串。

请注意，如果长时间没有遇到逗号，您仍然可能运行内存不足。（当遇到很长的行时，同样的问题适用于逐行读取文件。）

这是一个示例实现：

def split_file(file, sep=",", block_size=16384):
    last_fragment = ""
    while True:
        block = file.read(block_size)
        if not block:
            break
        block_fragments = iter(block.split(sep))
        last_fragment += next(block_fragments)
        for fragment in block_fragments:
            yield last_fragment
            last_fragment = fragment
    yield last_fragment

Answer 2

以下答案可以被认为是懒惰的，因为它一次读取文件一个字符：

def commaBreak(filename):
    word = ""
    with open(filename) as f:
        while True:
            char = f.read(1)
            if not char:
                print "End of file"
                yield word
                break
            elif char == ',':
                yield word
                word = ""
            else:
                word += char

您可以选择像这样处理更多字符，例如 1000，一次读取。

Answer 3

with open('eggs.txt', 'rb') as file:
for line in file:
    str_line = str(line)
    words = str_line.split(', ')
    for word in words:
        print(word)

我不太确定我是否知道你在问什么，你的意思是这样吗？

Answer 4

使用缓冲读取文件 (Python 3):

buffer_size = 2**12
delimiter = ','

with open(filename, 'r') as f:
    # remember the characters after the last delimiter in the previously processed chunk
    remaining = ""

    while True:
        # read the next chunk of characters from the file
        chunk = f.read(buffer_size)

        # end the loop if the end of the file has been reached
        if not chunk:
            break

        # add the remaining characters from the previous chunk,
        # split according to the delimiter, and keep the remaining
        # characters after the last delimiter separately
        *lines, remaining = (remaining + chunk).split(delimiter)

        # print the parts up to each delimiter one by one
        for line in lines:
            print(line, end=delimiter)

    # print the characters after the last delimiter in the file
    if remaining:
        print(remaining, end='')

请注意，按照当前的编写方式，它只会按原样打印原始文件的内容。这很容易改变，例如通过更改循环中传递给 print() 函数的 end=delimiter 参数。

Answer 5

它一次从文件中产生每个字符，这意味着没有内存过载。

def lazy_read():
    try:
        with open('eggs.txt', 'rb') as file:
            item = file.read(1)
            while item:
                if ',' == item:
                    raise StopIteration
                yield item
                item = file.read(1)
    except StopIteration:
        pass

print ''.join(lazy_read())

Python 延迟加载

Python Lazy Loading

python

lazy-loading