用于在 python 中预加载文件的线程缓冲区迭代器
Thread buffer iterator for preloading files in python
我有数百万个小文件,我想创建一个 FileLoader
class 使用后台线程将它们预加载到内存中的 file pool
以加快速度上东西。
我目前的解决方案是不线程缓冲区:
from itertools import islice, chain
class FileLoader(list):
def __init__(self,file_list):
# a list of file paths
self.fl = file_list
def Next(self,size=None): # get Next size=N file
if size: # batch mode
current_batch = []
for f in self.fl:
current_batch.append(open(f).read())
if len(current_batch) == size:
yield current_batch
current_batch = []
if current_batch:
yield current_batch
else: # sequence mode
for f in self.fl:
yield open(f).read()
if __name__ == '__main__':
fl = FileLoader(file_list)
for fs in fl.Next(5): # the files should be pooled in memory in advance
# ... my work....
import multiprocessing
def get_contents(filename):
with open(filename) as f:
return f.read()
pool = multiprocessing.Pool(processes=2) # or more
for fs in pool.imap(get_contents, file_list, 5) # 5 is the chunk size here
# ... your work ...
如果您不关心顺序,使用 imap_unordered
可能会更快。试验块大小和进程数。与您的草稿不同,此方法一次生成一个内容,但可以围绕它进行批处理。
我有数百万个小文件,我想创建一个 FileLoader
class 使用后台线程将它们预加载到内存中的 file pool
以加快速度上东西。
我目前的解决方案是不线程缓冲区:
from itertools import islice, chain
class FileLoader(list):
def __init__(self,file_list):
# a list of file paths
self.fl = file_list
def Next(self,size=None): # get Next size=N file
if size: # batch mode
current_batch = []
for f in self.fl:
current_batch.append(open(f).read())
if len(current_batch) == size:
yield current_batch
current_batch = []
if current_batch:
yield current_batch
else: # sequence mode
for f in self.fl:
yield open(f).read()
if __name__ == '__main__':
fl = FileLoader(file_list)
for fs in fl.Next(5): # the files should be pooled in memory in advance
# ... my work....
import multiprocessing
def get_contents(filename):
with open(filename) as f:
return f.read()
pool = multiprocessing.Pool(processes=2) # or more
for fs in pool.imap(get_contents, file_list, 5) # 5 is the chunk size here
# ... your work ...
如果您不关心顺序,使用 imap_unordered
可能会更快。试验块大小和进程数。与您的草稿不同,此方法一次生成一个内容,但可以围绕它进行批处理。