我应该如何提高 Python/Cython 性能？ Parallelization/memoryviews/numpy？

Question

我的任务：获取 3 个整数列表，每个列表都有一些乘数，看看是否可以重新排列元素以构成两个列表（具有更大的乘数）。

我有执行此操作的代码 - 遍历我的整个数据集，大约需要 15 秒：（编辑：修复错误）

%%cython
cdef bint my_check(
    list pattern1, 
    list pattern2, 
    list pattern3, 
    int amount1,
    int amount2,
    int amount3
):

    cdef dict all_items = dict()
    cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
    cdef bint bad_split = False

    # Pool the items together.

    for i in range(len(pattern1)):
        all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
    for i in range(len(pattern2)):
        all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
    for i in range(len(pattern3)):
        all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3

    # Iterate through possible split points:
    for m1 in range(total_amount//2, total_amount):
        m2 = total_amount - m1

        # Split items into those with quantities divisible at this point and those without
        divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
        not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}

        # Check that all of the element amounts that are not divisible by m1 are divisible by m2.
        for i in not_divisible:
            if not_divisible[i]%m2 != 0:
                bad_split = True
                break
        # If there is an element that doesn't divide by either, try the next split value.
        if bad_split:
            continue
        
        items1 = {i:divisible[i]//m1 for i in divisible}
        items2 = {i:not_divisible[i]//m2 for i in not_divisible}

        if <some other stuff here>:
            return True

    # Tried all of the split points
    return False

那么如果这个return为真，我运行另一个函数来做组合。在我的数据集中，my_check() 函数被调用 > 150,000 次（并且占用大部分时间）而另一个函数被调用 < 500 次，所以我不太关心优化那个函数。

我想将其并行化以提高性能，但我发现：

我的第一个想法是使用 numpy 函数来利用向量化，通过将 all_items 转换为 numpy 数组，使用 np.mod() 和 np.logical_not() 拆分项目，以及其他numpy 在最后一个 if 子句中运行，但与使用 dict comprehension
如果我将 m1 范围切换为 Cython prange，编译器会抱怨在没有 GIL 的情况下使用 Python 对象。我将字典切换为 cdef'd numpy 数组，但速度更慢。我尝试使用 memoryviews，但它们似乎不容易操作？我在这里读到另一个问题，切片不能分配给变量，所以我不知道如何使用它们。它不会让我在 for 循环中 cdef 新变量。

因为我运行宁在 m1 的不同值，并且一旦任何 return True，它应该是可并行化的，而不用担心竞争条件。

我的方法应该是什么？麻木的？赛通？还有别的吗？

我很高兴 post 从我的任何尝试中得到更详细的错误，但我认为 post 将它们全部处理起来会让人不知所措。我无法为此工作进行分析或行分析 - 我已经将相关的 # cython: 语句添加到 Jupyter notebook 单元格的顶部，但是当我运行它。

编辑：根据@DavidW 的回答，我用以下代码替换了中间的代码块，这将时间缩短了一半：

items1 = dict()
items2 = dict()
bad_split = False

for k,v in all_items.items():
    if v % m1 == 0:
        items1[k] = v//m1
    elif v % m2 == 0:
        items2[k] = v//m2
    else:
        bad_split = True
        break

如果可能的话，我仍然想找到一些利用我的多核处理器的方法。

Answer 1

您肯定可以对循环进行一些改进，这些改进不会改变基本方法，但可能会更快。我没有为这些计时，所以值得这样做而不是相信我的话。

for i in range(len(pattern1)):
    all_items[pattern1[i] = all_items.get(pattern1[i],0) + amount1

（忽略语法错误）。按项目迭代而不是在范围内迭代通常更理想化，并且它避免了两次查找（有时在 Cython 中不是这样，例如迭代 numpy 数组，但对于列表可能是这样）：

for pattern1_i in pattern1:
    all_items[pattern1_i] = all_items.get(pattern1_i,0) + amount1

更重要的是你有两个循环：

divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 != 0}

当您可以直接遍历键和值时，您正在浪费大量时间进行字典查找。例如

divisible = {k: v for k, v in all_items.items() if v//m1 == 0}

但是您还要遍历字典两次并执行相同的测试两次。

divisible = {}
not_divisible = {}
for k, v in all_items.items():
   if v//m1 == 0:
     divisible[k] = v
   else:
     not_divisible[k] = v

很可能将您的算法转换为涉及 Numpy 数组的东西，但这是一个相当重大的变化，超出了我的兴趣范围。

附录： 这些天我越来越不愿意推荐人们在 Cython 中使用 C++ 类。主要是因为 a) 它通常会导致非常笨拙的代码，b) 人们倾向于以货物文化的方式使用它，因为“它是 C++，所以它必须比 Python 对象更快，并且 c) 人们往往会忘记关于在每个函数的开头和结尾转换对象 to/from C++ 的成本。

但是，在这种情况下，它实际上可能是一个不错的选择，因为您的 dict 对象是统一类型的，并且完全包含在一个函数中。密钥替换为 dict -> unordered_map.

你想做的（粗略）是

from libcpp.unordered_map cimport unordered_map

然后键入 all_items、items1 和 items2 作为 cdef unordered_map[int, int（我认为...）。您在循环外执行此输入。其余代码基本保持不变（您可能需要找到 dict.get... 的替代品）。

一旦你让它作为一个串行计算工作，你应该能够将你的 for m1 in range(total_amount//2, total_amount): 变成一个 prange 循环，假设所有内容都正确输入，那么这应该可以并行工作。显然if <some other stuff here>是一个很大的未知数。

您必须在循环期间将all_items视为严格只读以避免竞争条件。但是，我希望 Cython 应该将 items1 和 items2 正确识别为循环局部变量。

这里有一个相当可以用作起点。 对于未来的读者：请三思是否真的需要将所有Python对象转换为C++对象；你可能不会

我应该如何提高 Python/Cython 性能？ Parallelization/memoryviews/numpy？

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

python

multithreading

numpy

vectorization

cython