Python3 中 sorted() 和 heapq 函数的性能

Question

我想使用 Python3 以最快的方式实现以下过程：给定 N 个随机整数列表，我需要 return K 个最小的（而且我不需要对 returned 整数进行排序）。我以三种不同的方式实现了它（如您在下面的代码中所见）。

test_sorted() 函数使用内置的 sorted() 函数对整个整数列表进行排序，然后取前 K 个元素的一部分。此操作的成本本质上应该是运行宁 sorted() 函数的成本，其时间复杂度为 O(N log(N)).
test_heap() 函数使用堆仅存储最低的 K 元素并 return 存储它们。在堆上插入一个元素的时间复杂度为 O(log(N))，理论上我们需要将一个元素压入堆中的时间为 N。然而，在第一个 K 插入之后，我们将从堆中推入和弹出，我希望如果传入元素大于堆中的任何元素，则不会发生插入，时间复杂度应该介于 O(K log(N)) 和 O(N log(N)) （取决于输入列表的实际排序）。无论如何，即使我的假设不正确，最差的复杂度也应该是 O(N log(N))（像往常一样，我认为我们需要的所有比较的成本可以忽略不计）。
test_nsmallest() 函数使用 heapq 模块中的 nsmallest() 函数。我对这种方法没有任何期望，因为在官方 python 文档中我只发现

For larger values, it is more efficient to use the sorted() function. I decided to give it a try.

# test.py

from heapq import heappush, heappushpop, nsmallest
from random import randint
from timeit import timeit

N, K = 1000, 50
RANDOM_INTS = [randint(1,100) for _ in range(N)]

def test_sorted():
    return sorted(RANDOM_INTS)[:K]

def test_heap():
    heap = []
    for val in RANDOM_INTS:
        if len(heap) < K:
            heappush(heap, -val)
        else:
            heappushpop(heap, -val)
    return [-val for val in heap]

def test_nsmallest():
    return nsmallest(K, RANDOM_INTS)


def main():
    sorted_result = timeit("test_sorted()", globals=globals(), number=100_000)
    print(f"test_sorted took: {sorted_result}")

    heap_result = timeit("test_heap()", globals=globals(), number=100_000)
    print(f"test_heap took: {heap_result}")

    nsmallest_result = timeit("test_nsmallest()", globals=globals(), number=100_000)
    print(f"test_nsmallest took: {nsmallest_result}")

    r1, r2, r3 = test_sorted(), test_heap(), test_nsmallest()
    assert len(r1) == len(r2) == len(r3)
    assert set(r1) == set(r2) == set(r3)


if __name__ == '__main__':
    main()

我的（旧）2011 年末 MacBook Pro 配备 2.4GHz i7 处理器的输出如下。

$ python --version
Python 3.9.2

$ python test.py 
test_sorted took: 8.389572635999999
test_heap took: 18.586762750000002
test_nsmallest took: 13.772040639000004

使用 sorted() 的最简单的解决方案是迄今为止最好的，谁能详细说明为什么结果不符合我的预期（即 test_heap() 函数应该至少快一点)?我错过了什么？

如果我运行用pypy同样的代码结果是相反的

$ pypy --version
Python 3.7.10 (51efa818fd9b, Apr 04 2021, 12:03:51)
[PyPy 7.3.4 with GCC Apple LLVM 12.0.0 (clang-1200.0.32.29)]

$ pypy test.py 
test_sorted took: 7.1336525249998886
test_heap took: 3.1177806880004937
test_nsmallest took: 7.5453417899998385

这更接近我的预期。

如果我对 python 的内部结构一无所知，并且我对为什么 pypy 比 python 快有一个非常粗略的理解，任何人都可以详细说明这些结果并添加一些关于什么的信息是为了让我正确预见未来类似情况的最佳选择？

此外，如果您对运行比上述实现更快的其他实现有任何建议，请随时分享！

更新：

如果我们需要根据一些不是项目本身的值的标准对输入列表进行排序怎么办（正如我在实际用例中实际需要的那样；以上只是一种简化）？好吧，在这种情况下，结果更令人惊讶：

# test2.py

from heapq import heappush, heappushpop, nsmallest
from random import randint
from timeit import timeit


N, K = 1000, 50
RANDOM_INTS = [randint(1,100) for _ in range(N)]


def test_sorted():
    return sorted(RANDOM_INTS, key=lambda x: x)[:K]

def test_heap():
    heap = []
    for val in RANDOM_INTS:
        if len(heap) < K:
            heappush(heap, (-val, val))
        else:
            heappushpop(heap, (-val, val))
    return [val for _, val in heap]

def test_nsmallest():
    return nsmallest(K, RANDOM_INTS, key=lambda x: x)


def main():
    sorted_result = timeit("test_sorted()", globals=globals(), number=100_000)
    print(f"test_sorted took: {sorted_result}")

    heap_result = timeit("test_heap()", globals=globals(), number=100_000)
    print(f"test_heap took: {heap_result}")

    nsmallest_result = timeit("test_nsmallest()", globals=globals(), number=100_000)
    print(f"test_nsmallest took: {nsmallest_result}")

    r1, r2, r3 = test_sorted(), test_heap(), test_nsmallest()
    assert len(r1) == len(r2) == len(r3)
    assert set(r1) == set(r2) == set(r3)


if __name__ == '__main__':
    main()

输出：

$ python test2.py 
test_sorted took: 18.740868524
test_heap took: 27.694126547999996
test_nsmallest took: 25.414596833000004

$ pypy test2.py 
test_sorted took: 65.88409741500072
test_heap took: 3.9442632220016094
test_nsmallest took: 19.981832798999676

这至少告诉我两件事：

使用外部键进行排序非常昂贵，无论是当您使用 key kwarg 提供 lambda 函数时，还是当您需要构建元组 (sorting_value, actual_value) 以获得堆中所需的顺序。
将 lambda 与 pypy 一起使用似乎非常昂贵，但我不知道为什么......也许 pypy 无法优化它们并且这不能与它执行的其他优化一起使用？ ??

Answer 1

您正在使用 CPython解释器和 PyPy 即时编译器 对一个小数组进行排序。结果，出现了许多复杂的开销。内置调用可能比手动编写的包含循环的纯python代码更快。

渐近复杂度 仅适用于大值，因为缺少 常数因子：O(n log2(n) + 30 n) 算法可能比O(2 n log2(n)) 算法在实践中用于 n < 1 000 000 000 而两者都是 O(n log2(n))... 实际因素很难知道，因为许多重要的 硬件效果 应该是考虑在内。

另外，对于Heapsort，必须将所有项都插入到堆中才能得到正确的结果（不添加的可以是最小的）。这可以在 O(n) 时间内完成。因此，要在 n 大小的列表中获取前 k 个值，复杂度为 O(k log(n) + n)（不考虑隐藏常量）。

The simplest solution using sorted() its by far the best, can anyone elaborate on why the outcome does not match my expectation (i.e., that the test_heap() function should be at least a bit faster)?

sorted是一个非常优化的内置函数。 Python uses the very fast Timsort algorithm。 Timsort 通常比 naive Heapsort 更快。这就是为什么它比 nsmallest 快的原因，尽管它很复杂。此外，您的 Heapsort 是用 pure-python.

编写的

此外，在 CPython 中，三种实现的大部分时间是处理排序列表和创建新列表的开销（在我的机器上大约是一半时间）。 PyPy 可以减轻开销但不能完全消除它们。请记住，Python 列表是一个复杂的动态对象，具有许多间接内存（需要在其中存储动态类型的对象）。

Provided that I know nothing about the python internals and I only have a very rough understanding of why pypy is faster than python, can anyone elaborate on those results and add some information about what is going on in order to allow me to correctly foresee the best choice for similar situations in the future?

最好的解决方案是当您可以安全地说其中的所有类型都是本机类型时，不要使用 Python 列表：固定大小的整数、simple/double-precision 浮点数。相反，使用 Numpy！但是，请记住 Numpy/List 转换速度非常慢。

在这里，最快的解决方案是使用 np.random.randint(0, 100, N) 直接创建一个随机整数的 Numpy 数组，然后使用 分区算法 检索 k - 使用 np.partition(data, k)[:k] 的最小数字。如果需要，您可以对生成的 k 大小的数组进行排序。请注意，使用堆是执行分区的一种方法，但这远不是最快的算法（参见 QuickSelect for example). Finally, please note that there are fast O(n) sorting algorithms for integers like RadixSort.

Using lambdas with pypy seems to be extremely expensive, but I don't know why...

AFAIK，这种情况是 PyPy 的性能问题 (due to internal guards)。团队意识到了这一点，并计划在未来改进此类案例的表现。一般的经验法则是尽可能避免动态代码以获得快速执行（例如，纯python 对象，如列表和字典以及 lambdas）。

Python3 中 sorted() 和 heapq 函数的性能

Performance of sorted() and heapq functions in Python3

python

performance

pypy

time-complexity

python-3.x