Multiprocessing.pool 具有多个 args 和 kwargs 的函数
Multiprocessing.pool with a function that has multiple args and kwargs
我想使用 mutliprocessing.pool 方法并行计算。问题是我想在计算中使用的函数提供了两个参数和可选的 kwargs,第一个参数是数据框,第二个参数是 str,任何 kwargs 都是字典。
对于我尝试执行的所有计算,我要使用的数据框和字典都是相同的,只是第二个不断变化的参数。因此,我希望能够使用 map 方法将它作为不同字符串的列表传递给已经使用 df 和 dict 打包的函数。
from utils import *
import multiprocessing
from functools import partial
def sumifs(df, result_col, **kwargs):
compare_cols = list(kwargs.keys())
operators = {}
for col in compare_cols:
if type(kwargs[col]) == tuple:
operators[col] = kwargs[col][0]
kwargs[col] = list(kwargs[col][1])
else:
operators[col] = operator.eq
kwargs[col] = list(kwargs[col])
result = []
cache = {}
# Go through each value
for i in range(len(kwargs[compare_cols[0]])):
compare_values = [kwargs[col][i] for col in compare_cols]
cache_key = ','.join([str(s) for s in compare_values])
if (cache_key in cache):
entry = cache[cache_key]
else:
df_copy = df.copy()
for compare_col, compare_value in zip(compare_cols, compare_values):
df_copy = df_copy.loc[operators[compare_col](df_copy[compare_col], compare_value)]
entry = df_copy[result_col].sum()
cache[cache_key] = entry
result.append(entry)
return pd.Series(result)
if __name__ == '__main__':
ca = read_in_table('Tab1')
total_consumer_ids = len(ca)
base = pd.DataFrame()
base['ID'] = range(1, total_consumer_ids + 1)
result_col= ['A', 'B', 'C']
keywords = {'Z': base['Consumer archetype ID']}
max_number_processes = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=max_number_processes) as pool:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
print(results)
然而,当我运行上面的代码时,我得到以下错误:TypeError: sumifs() missing 1 required positional argument: 'result_col'
。我如何为函数提供第一个 arg 和 kwargs,同时提供第二个参数作为 str 列表,以便我可以并行计算?我在论坛中阅读了几个类似的问题,但 none 的解决方案似乎适用于这种情况...
谢谢,如果有什么不清楚的地方,我深表歉意,我今天才知道 multiprocessing 包!
让我们看一下您的代码的两个部分。
首先是sumifs
函数声明:
def sumifs(df, result_col, **kwargs):
其次,使用相关参数调用此函数。
# Those are the params
ca = read_in_table('Tab1')
keywords = {'Z': base['Consumer archetype ID']}
# This is the function call
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), tasks)
更新 1:
经过原代码edited.It看起来问题是位置参数赋值,尝试丢弃它。
替换行:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
与:
results = pool.map(partial(sumifs, ca, **keywords), result_col)
示例代码:
import multiprocessing
from functools import partial
def test_func(arg1, arg2, **kwargs):
print(arg1)
print(arg2)
print(kwargs)
return arg2
if __name__ == '__main__':
list_of_args2 = [1, 2, 3]
just_a_dict = {'key1': 'Some value'}
with multiprocessing.Pool(processes=3) as pool:
results = pool.map(partial(test_func, 'This is arg1', **just_a_dict), list_of_args2)
print(results)
将输出:
This is arg1
1
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
['1', '2', '3']
有关如何
的更多示例
更新 2:
扩展示例(由于注释):
I wonder however, in the same fashion, if my function had three args and kwargs, and I wanted to keep arg1, arg3 and kwargs costant, how could I pass arg2 as a list for multiprocessing? In essence, how will I inidicate multiprocessing that map(partial(test_func, 'This is arg1', 'This would be arg3', **just_a_dict), arg2) the second value in partial corresponds to arg3 and not arg2?
更新 1 代码将更改如下:
# The function signature
def test_func(arg1, arg2, arg3, **kwargs):
# The map call
pool.map(partial(test_func, 'This is arg1', arg3='This is arg3', **just_a_dict), list_of_args2)
这可以使用 python 位置和关键字分配 来完成。
请注意,kwargs
被放在一边,没有使用 关键字 分配,尽管它位于 关键字 分配值之后。
可以找到有关参数分配差异的更多信息here。
如果有一条数据是constant/fixed横跨所有works/jobs,那么在创建的时候最好"initialize"池中有这条固定数据的进程池并映射不同的数据。这避免了每次作业请求都重新发送固定数据。对于您的情况,我会执行以下操作:
df = None
kw = {}
def initialize(df_in, kw_in):
global df, kw
df, kw = df_in, kw_in
def worker(data):
# computation involving df, kw, and data
...
...
with multiprocessing.Pool(max_number_processes, intializer, (base, keywords)) as pool:
pool.map(worker, varying_data)
此 gist contains a full blown example of using the initializer. This blog post 解释了使用初始化程序带来的性能提升。
我想使用 mutliprocessing.pool 方法并行计算。问题是我想在计算中使用的函数提供了两个参数和可选的 kwargs,第一个参数是数据框,第二个参数是 str,任何 kwargs 都是字典。
对于我尝试执行的所有计算,我要使用的数据框和字典都是相同的,只是第二个不断变化的参数。因此,我希望能够使用 map 方法将它作为不同字符串的列表传递给已经使用 df 和 dict 打包的函数。
from utils import *
import multiprocessing
from functools import partial
def sumifs(df, result_col, **kwargs):
compare_cols = list(kwargs.keys())
operators = {}
for col in compare_cols:
if type(kwargs[col]) == tuple:
operators[col] = kwargs[col][0]
kwargs[col] = list(kwargs[col][1])
else:
operators[col] = operator.eq
kwargs[col] = list(kwargs[col])
result = []
cache = {}
# Go through each value
for i in range(len(kwargs[compare_cols[0]])):
compare_values = [kwargs[col][i] for col in compare_cols]
cache_key = ','.join([str(s) for s in compare_values])
if (cache_key in cache):
entry = cache[cache_key]
else:
df_copy = df.copy()
for compare_col, compare_value in zip(compare_cols, compare_values):
df_copy = df_copy.loc[operators[compare_col](df_copy[compare_col], compare_value)]
entry = df_copy[result_col].sum()
cache[cache_key] = entry
result.append(entry)
return pd.Series(result)
if __name__ == '__main__':
ca = read_in_table('Tab1')
total_consumer_ids = len(ca)
base = pd.DataFrame()
base['ID'] = range(1, total_consumer_ids + 1)
result_col= ['A', 'B', 'C']
keywords = {'Z': base['Consumer archetype ID']}
max_number_processes = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=max_number_processes) as pool:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
print(results)
然而,当我运行上面的代码时,我得到以下错误:TypeError: sumifs() missing 1 required positional argument: 'result_col'
。我如何为函数提供第一个 arg 和 kwargs,同时提供第二个参数作为 str 列表,以便我可以并行计算?我在论坛中阅读了几个类似的问题,但 none 的解决方案似乎适用于这种情况...
谢谢,如果有什么不清楚的地方,我深表歉意,我今天才知道 multiprocessing 包!
让我们看一下您的代码的两个部分。
首先是sumifs
函数声明:
def sumifs(df, result_col, **kwargs):
其次,使用相关参数调用此函数。
# Those are the params
ca = read_in_table('Tab1')
keywords = {'Z': base['Consumer archetype ID']}
# This is the function call
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), tasks)
更新 1:
经过原代码edited.It看起来问题是位置参数赋值,尝试丢弃它。
替换行:
results = pool.map(partial(sumifs, a=ca, kwargs=keywords), result_col)
与:
results = pool.map(partial(sumifs, ca, **keywords), result_col)
示例代码:
import multiprocessing
from functools import partial
def test_func(arg1, arg2, **kwargs):
print(arg1)
print(arg2)
print(kwargs)
return arg2
if __name__ == '__main__':
list_of_args2 = [1, 2, 3]
just_a_dict = {'key1': 'Some value'}
with multiprocessing.Pool(processes=3) as pool:
results = pool.map(partial(test_func, 'This is arg1', **just_a_dict), list_of_args2)
print(results)
将输出:
This is arg1
1
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
This is arg1
2
{'key1': 'Some value'}
['1', '2', '3']
有关如何
更新 2:
扩展示例(由于注释):
I wonder however, in the same fashion, if my function had three args and kwargs, and I wanted to keep arg1, arg3 and kwargs costant, how could I pass arg2 as a list for multiprocessing? In essence, how will I inidicate multiprocessing that map(partial(test_func, 'This is arg1', 'This would be arg3', **just_a_dict), arg2) the second value in partial corresponds to arg3 and not arg2?
更新 1 代码将更改如下:
# The function signature
def test_func(arg1, arg2, arg3, **kwargs):
# The map call
pool.map(partial(test_func, 'This is arg1', arg3='This is arg3', **just_a_dict), list_of_args2)
这可以使用 python 位置和关键字分配 来完成。
请注意,kwargs
被放在一边,没有使用 关键字 分配,尽管它位于 关键字 分配值之后。
可以找到有关参数分配差异的更多信息here。
如果有一条数据是constant/fixed横跨所有works/jobs,那么在创建的时候最好"initialize"池中有这条固定数据的进程池并映射不同的数据。这避免了每次作业请求都重新发送固定数据。对于您的情况,我会执行以下操作:
df = None
kw = {}
def initialize(df_in, kw_in):
global df, kw
df, kw = df_in, kw_in
def worker(data):
# computation involving df, kw, and data
...
...
with multiprocessing.Pool(max_number_processes, intializer, (base, keywords)) as pool:
pool.map(worker, varying_data)
此 gist contains a full blown example of using the initializer. This blog post 解释了使用初始化程序带来的性能提升。