Python 延迟加载
Python Lazy Loading
以下代码将逐行延迟打印文本文件的内容,每次打印都在 '/n' 处停止。
with open('eggs.txt', 'rb') as file:
for line in file:
print line
是否有任何配置可以延迟打印文本文件的内容,每次打印都在 ', ' 处停止?
(或任何其他 character/string)
我问这个是因为我正在尝试读取一个文件,其中包含一个 2.9 GB 的长行,用逗号分隔。
PS。我的问题与这个不同:Read large text files in Python, line by line without loading it in to memory
我在问如何在换行符 ('\n')
以外的字符处停止
我认为没有内置的方法可以实现这一点。您将必须使用 file.read(block_size)
逐块读取文件,以逗号分隔每个块,并手动重新连接跨越块边界的字符串。
请注意,如果长时间没有遇到逗号,您仍然可能 运行 内存不足。 (当遇到很长的行时,同样的问题适用于逐行读取文件。)
这是一个示例实现:
def split_file(file, sep=",", block_size=16384):
last_fragment = ""
while True:
block = file.read(block_size)
if not block:
break
block_fragments = iter(block.split(sep))
last_fragment += next(block_fragments)
for fragment in block_fragments:
yield last_fragment
last_fragment = fragment
yield last_fragment
以下答案可以被认为是懒惰的,因为它一次读取文件一个字符:
def commaBreak(filename):
word = ""
with open(filename) as f:
while True:
char = f.read(1)
if not char:
print "End of file"
yield word
break
elif char == ',':
yield word
word = ""
else:
word += char
您可以选择像这样处理更多字符,例如 1000,一次读取。
with open('eggs.txt', 'rb') as file:
for line in file:
str_line = str(line)
words = str_line.split(', ')
for word in words:
print(word)
我不太确定我是否知道你在问什么,你的意思是这样吗?
使用缓冲读取文件 (Python 3):
buffer_size = 2**12
delimiter = ','
with open(filename, 'r') as f:
# remember the characters after the last delimiter in the previously processed chunk
remaining = ""
while True:
# read the next chunk of characters from the file
chunk = f.read(buffer_size)
# end the loop if the end of the file has been reached
if not chunk:
break
# add the remaining characters from the previous chunk,
# split according to the delimiter, and keep the remaining
# characters after the last delimiter separately
*lines, remaining = (remaining + chunk).split(delimiter)
# print the parts up to each delimiter one by one
for line in lines:
print(line, end=delimiter)
# print the characters after the last delimiter in the file
if remaining:
print(remaining, end='')
请注意,按照当前的编写方式,它只会按原样打印原始文件的内容。这很容易改变,例如通过更改循环中传递给 print()
函数的 end=delimiter
参数。
它一次从文件中产生每个字符,这意味着没有内存过载。
def lazy_read():
try:
with open('eggs.txt', 'rb') as file:
item = file.read(1)
while item:
if ',' == item:
raise StopIteration
yield item
item = file.read(1)
except StopIteration:
pass
print ''.join(lazy_read())
以下代码将逐行延迟打印文本文件的内容,每次打印都在 '/n' 处停止。
with open('eggs.txt', 'rb') as file:
for line in file:
print line
是否有任何配置可以延迟打印文本文件的内容,每次打印都在 ', ' 处停止?
(或任何其他 character/string)
我问这个是因为我正在尝试读取一个文件,其中包含一个 2.9 GB 的长行,用逗号分隔。
PS。我的问题与这个不同:Read large text files in Python, line by line without loading it in to memory 我在问如何在换行符 ('\n')
以外的字符处停止我认为没有内置的方法可以实现这一点。您将必须使用 file.read(block_size)
逐块读取文件,以逗号分隔每个块,并手动重新连接跨越块边界的字符串。
请注意,如果长时间没有遇到逗号,您仍然可能 运行 内存不足。 (当遇到很长的行时,同样的问题适用于逐行读取文件。)
这是一个示例实现:
def split_file(file, sep=",", block_size=16384):
last_fragment = ""
while True:
block = file.read(block_size)
if not block:
break
block_fragments = iter(block.split(sep))
last_fragment += next(block_fragments)
for fragment in block_fragments:
yield last_fragment
last_fragment = fragment
yield last_fragment
以下答案可以被认为是懒惰的,因为它一次读取文件一个字符:
def commaBreak(filename):
word = ""
with open(filename) as f:
while True:
char = f.read(1)
if not char:
print "End of file"
yield word
break
elif char == ',':
yield word
word = ""
else:
word += char
您可以选择像这样处理更多字符,例如 1000,一次读取。
with open('eggs.txt', 'rb') as file:
for line in file:
str_line = str(line)
words = str_line.split(', ')
for word in words:
print(word)
我不太确定我是否知道你在问什么,你的意思是这样吗?
使用缓冲读取文件 (Python 3):
buffer_size = 2**12
delimiter = ','
with open(filename, 'r') as f:
# remember the characters after the last delimiter in the previously processed chunk
remaining = ""
while True:
# read the next chunk of characters from the file
chunk = f.read(buffer_size)
# end the loop if the end of the file has been reached
if not chunk:
break
# add the remaining characters from the previous chunk,
# split according to the delimiter, and keep the remaining
# characters after the last delimiter separately
*lines, remaining = (remaining + chunk).split(delimiter)
# print the parts up to each delimiter one by one
for line in lines:
print(line, end=delimiter)
# print the characters after the last delimiter in the file
if remaining:
print(remaining, end='')
请注意,按照当前的编写方式,它只会按原样打印原始文件的内容。这很容易改变,例如通过更改循环中传递给 print()
函数的 end=delimiter
参数。
它一次从文件中产生每个字符,这意味着没有内存过载。
def lazy_read():
try:
with open('eggs.txt', 'rb') as file:
item = file.read(1)
while item:
if ',' == item:
raise StopIteration
yield item
item = file.read(1)
except StopIteration:
pass
print ''.join(lazy_read())