用于在 python 中预加载文件的线程缓冲区迭代器

Thread buffer iterator for preloading files in python

我有数百万个小文件,我想创建一个 FileLoader class 使用后台线程将它们预加载到内存中的 file pool 以加快速度上东西。

我目前的解决方案是线程缓冲区:

from itertools import islice, chain

class FileLoader(list):
    def __init__(self,file_list):
        # a list of file paths
        self.fl = file_list 

    def Next(self,size=None): # get Next size=N file
        if size: # batch mode
            current_batch = []
            for f in self.fl:
                current_batch.append(open(f).read())
                if len(current_batch) == size:
                    yield current_batch
                    current_batch = []
            if current_batch:
                yield current_batch

        else: # sequence mode
            for f in self.fl:
                yield open(f).read()

if __name__ == '__main__':
    fl = FileLoader(file_list)
    for fs in fl.Next(5): # the files should be pooled in memory in advance
        # ... my work....
import multiprocessing

def get_contents(filename):
    with open(filename) as f:
        return f.read()

pool = multiprocessing.Pool(processes=2) # or more
for fs in pool.imap(get_contents, file_list, 5) # 5 is the chunk size here
    # ... your work ...

如果您不关心顺序,使用 imap_unordered 可能会更快。试验块大小和进程数。与您的草稿不同,此方法一次生成一个内容,但可以围绕它进行批处理。