python 多处理数据帧行

Question

  def main():     
    df_master = read_bb_csv(file)
                p = Pool(2)
                if len(df_master.index) >= 1:
                    for row in df_master.itertuples(index=True, name='Pandas'):
                         p.map((partial(check_option, arg1=row), df_master))
    
    
    def check_option(row):
       get_price(row)

我正在使用 Pandas 读取 CSV，遍历行并处理信息。给出函数 get_price() 需要进行多个 http 调用，我想使用多进程一次处理所有行（取决于 CPU 核心）以加速函数。

我遇到的问题是，我是多进程的新手，不知道如何使用 p.map((check_option, arg1=row), df_master) 处理数据帧中的所有行。没有必要将 return 的 row 值返回给函数。只需要允许行到进程来处理。

感谢您的帮助。

Answer 1

您可以使用我随处使用的以下 python3 版本，它非常有用！还有一个 python3 包 mpire，我发现它非常有用，用法类似于 python3 的多处理包。

from multiprocessing import Pool
import pandas as pd

def get_price(idx, row):
    # logic to fetch price
    return idx, price

def main():
    df = pd.read_csv("path to file")
    NUM_OF_WORKERS = 2 
    pool = Pool(NUM_OF_WORKERS)
    results = [pool.apply_async(get_price, [idx, row]) for idx, row in df.iterrows()]
    for result in results:
        idx, price = result.get()
        df.loc[idx, 'Price'] = price
    # do whatever you want to do with df, save it to same file.

if __name__ == "__main__":
    # don't forget to call main func as module
    # This is must in windows use multiple processes/threads. It's also a good practice, more info on this page https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming
    main()

python 多处理数据帧行

python multiprocessing dataframe rows

python

csv

multiprocessing

pandas