Python - 并行读取多个大文件并单独生成它们
Python - read from multiple big files in paralell and yield them individually
我有多个大文件,需要逐行生成它们。像这样的伪代码:
def get(self):
with open(file_list, "r") as files:
for file in files:
yield file.readline()
我该怎么做?
使用上下文管理器来做这件事会很棘手(或需要一些额外的库),但如果没有上下文管理器应该不会很困难。 open()
采用单个文件的名称,因此假设 file_list
是命名输入文件的字符串列表,这应该有效:
def get(files_list):
file_handles = [open(f, 'r') for f in files_list]
while file_handles:
for fd in file_handles:
line = fd.readline()
if line:
yield line
else:
file_handles.remove(fd)
我假设您想继续前进,直到从每个文件中读取每一行,较短的文件在到达 EOF 时被丢弃。
itertools
documentation has several recipes, among them a very neat round-robin recipe. I would also use ExitStack
与多个文件上下文管理器一起工作:
from itertools import cycle, islice
from contextlib import ExitStack
# https://docs.python.org/3.8/library/itertools.html#itertools-recipes
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
num_active = len(iterables)
nexts = cycle(iter(it).__next__ for it in iterables)
while num_active:
try:
for next in nexts:
yield next()
except StopIteration:
# Remove the iterator we just exhausted from the cycle.
num_active -= 1
nexts = cycle(islice(nexts, num_active))
...
def get(self):
with open(files_list) as fl:
filenames = [x.strip() for x in fl]
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
yield from roundrobin(*files)
虽然,也许最好的设计是使用控制反转,并将文件对象序列作为参数提供给 .get
,因此调用代码应该注意使用退出堆栈:
class Foo:
...
def get(self, files):
yield from roundrobin(*files)
# calling code:
foo = Foo() # or however it is initialized
with open(files_list) as fl:
filenames = [x.strip() for x in fl]
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
for line in foo.get(files):
do_something_with_line(line)
我有多个大文件,需要逐行生成它们。像这样的伪代码:
def get(self):
with open(file_list, "r") as files:
for file in files:
yield file.readline()
我该怎么做?
使用上下文管理器来做这件事会很棘手(或需要一些额外的库),但如果没有上下文管理器应该不会很困难。 open()
采用单个文件的名称,因此假设 file_list
是命名输入文件的字符串列表,这应该有效:
def get(files_list):
file_handles = [open(f, 'r') for f in files_list]
while file_handles:
for fd in file_handles:
line = fd.readline()
if line:
yield line
else:
file_handles.remove(fd)
我假设您想继续前进,直到从每个文件中读取每一行,较短的文件在到达 EOF 时被丢弃。
itertools
documentation has several recipes, among them a very neat round-robin recipe. I would also use ExitStack
与多个文件上下文管理器一起工作:
from itertools import cycle, islice
from contextlib import ExitStack
# https://docs.python.org/3.8/library/itertools.html#itertools-recipes
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
num_active = len(iterables)
nexts = cycle(iter(it).__next__ for it in iterables)
while num_active:
try:
for next in nexts:
yield next()
except StopIteration:
# Remove the iterator we just exhausted from the cycle.
num_active -= 1
nexts = cycle(islice(nexts, num_active))
...
def get(self):
with open(files_list) as fl:
filenames = [x.strip() for x in fl]
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
yield from roundrobin(*files)
虽然,也许最好的设计是使用控制反转,并将文件对象序列作为参数提供给 .get
,因此调用代码应该注意使用退出堆栈:
class Foo:
...
def get(self, files):
yield from roundrobin(*files)
# calling code:
foo = Foo() # or however it is initialized
with open(files_list) as fl:
filenames = [x.strip() for x in fl]
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
for line in foo.get(files):
do_something_with_line(line)