跨 CPU 个内核并行化代码,该代码遍历总共 700K 条目的嵌套字典

Parallelizing a code across CPU cores that iterates over a nested dictionary of a total 700K entries

我有以下代码:

for key in test_large_images.keys():
    test_large_images[key]['avg_prob'] = 0
    sum = 0
    for value in test_large_images[key]['pred_probability']:
        print(test_large_images[key]['pred'])
        print(type(test_large_images[key]['pred'] ))
        if test_large_images[key]['pred'] == 1:
            sum += value
    test_large_images[key]['avg_prob'] = sum/len(test_large_images[key]['pred_probability'])

它是一个包含 359 个大图像的字典,每个图像可以包含 200 到 8000 个较小的图像,我称之为补丁。 test_large_images 是对较小图像进行推理的字典,其中还包含每个补丁的预测概率、大图像名称、补丁名称等。我的目标是根据该图像内较小块预测概率的预测概率,找到较大图像的平均概率。当我 运行 在我已将其推论保存在 pkl 文件中的较小数据集(45K 补丁)上循环时,它 运行 非常快。但是,这个脚本已经 运行 运行了 130 多分钟,正如您在 VSCode 远程(在 Mac 上有本地客户端)上远程在 Jupyter Notebook 中看到的那样。

有什么方法可以利用我必须的 24 CPU 个核心来加速这个嵌套字典计算?

  1. 不要使用 sum 作为变量名,因为它是一个内置函数。
  2. 不需要第 test_large_images[key]['avg_prob'] = 0 行。
  3. PeterK 是正确的,您的条件不需要每次都在内部 for 循环中计算。
  4. 为什么我们要重复打印这些,还是只是为了测试?
for key in test_large_images.keys():
    add = 0
    condition = test_large_images[key]['pred'] == 1 # This is what PeterK means by take it out (of the loop).
    for value in test_large_images[key]['pred_probability']:
        # print(test_large_images[key]['pred'])
        # print(type(test_large_images[key]['pred']))
        if condition:
            add += value
    test_large_images[key]['avg_prob'] = add/len(test_large_images[key]['pred_probability'])

您的代码可以简化为:

for key in test_large_images.keys():
    condition = test_large_images[key]['pred'] == 1
    num = sum(x for x in test_large_images[key]['pred_probability'] if condition)
    denom = len(test_large_images[key]['pred_probability'])
    test_large_images[key]['avg_prob'] = num/denom

基于反馈和一些额外的优化:

for key in test_large_images.keys():
    if test_large_images[key]['pred'] != 1:
        test_large_images[key]['avg_prob'] = 0
        continue
    values = test_large_images[key]['pred_probability']
    test_large_images[key]['avg_prob'] = sum(values)/len(values)

这是两种不同类型的平均(我最感兴趣的是仅对预测为 1 的条目数取概率的平均值)。我称之为 avg_prob_pos

for key in progress_bar(test_large_images.keys()):
    condition = test_large_images[key]['pred'] == 1
    num = sum(x for x in test_large_images[key]['pred_probability'] if condition)
    denom = len(test_large_images[key]['pred_probability'])
    count = sum(x for x in test_large_images[key]['pred'] if condition)
    if count != 0:
        test_large_images[key]['avg_prob_pos'] = num/count
    test_large_images[key]['avg_prob'] = num/denom
    
    percentage = test_large_images[key]['pred'].count(1)/len(test_large_images[key]['pred'])
    test_large_images[key]['percentage'] = percentage