for循环内的多处理

Question

我已经阅读了有关多处理包和线程模块的信息，但我不太确定如何在我的案例中使用它，尽管我仍然认为我可以从实施它中受益。

我目前正在编写一个处理和抓取一堆 HTML 文件的管道。我的清理方法遍历所有 HTML 文件并处理它们，方法是调用另一种提取数据的方法和 returns 一个 pandas 数据框。清理方法目前等待一个文件完成解析，这就是为什么我认为多处理在这里会有帮助。

我不太确定线程或多处理是否是正确的选择，但我认为由于任务是 CPU-绑定多处理应该是完美的

这就是我的代码现在的样子：

def get_clean_df(self):
    # iterate through all existing html files and parse them
    for filepath in glob.glob("../data/source/*/*.html"):
    # expand existing dataframe with the newly parsed result
        result = pd.concat([result, self._extract_df_from_html(filepath)])

return result

感谢大家的帮助

Answer 1

根据我的意见，你可以创建这样的东西：

import pandas as pd
import multiprocessing
import glob

def extract_df_from_html(filepath):
    # Do stuff here
    df = pd.DataFrame()
    return df

class Foo():
    def process(self):
        files = glob.glob("../data/source/*/*.html")
        with multiprocessing.Pool(4) as pool:
            result = pool.map(extract_df_from_html, files)
        self.result = pd.concat(result, ignore_index=True)

if __name__ == '__main__':
    foo = Foo()
    foo.process()

for循环内的多处理

multiprocessing inside a for loop

python

multiprocessing

pandas