为什么 Python 列表在排序时变慢?

Why is Python list slower when sorted?

在下面的代码中,我创建了两个具有相同值的列表:一个列表未排序 (s_not),另一个列表已排序 (s_yes)。这些值由 randint() 创建。我 运行 对每个列表进行一些循环并计时。

import random
import time

for x in range(1,9):

    r = 10**x # do different val for the bound in randint()
    m = int(r/2)

    print("For rand", r)

    # s_not is non sorted list
    s_not = [random.randint(1,r) for i in range(10**7)]

    # s_yes is sorted
    s_yes = sorted(s_not)

    # do some loop over the sorted list
    start = time.time()
    for i in s_yes:
        if i > m:
            _ = 1
        else:
            _ = 1
    end = time.time()
    print("yes", end-start)

    # do the same to the unsorted list
    start = time.time()
    for i in s_not:
        if i > m:
            _ = 1
        else:
            _ = 1
    end = time.time()
    print("not", end-start)

    print()

输出:

For rand 10
yes 1.0437555313110352
not 1.1074268817901611

For rand 100
yes 1.0802974700927734
not 1.1524150371551514

For rand 1000
yes 2.5082249641418457
not 1.129960298538208

For rand 10000
yes 3.145440101623535
not 1.1366300582885742

For rand 100000
yes 3.313387393951416
not 1.1393756866455078

For rand 1000000
yes 3.3180911540985107
not 1.1336982250213623

For rand 10000000
yes 3.3231537342071533
not 1.13503098487854

For rand 100000000
yes 3.311596393585205
not 1.1345293521881104

因此,当增加 randint() 中的界限时,排序列表的循环会变慢。为什么?

缓存未命中。当 N int 对象被背靠背分配时,为保存它们而保留的内存往往位于连续的块中。因此,按分配顺序遍历列表也往往会以顺序、连续、递增的顺序访问保存整数值的内存。

打乱它,并且在列表中爬行时的访问模式也是随机的。缓存未命中比比皆是,前提是有足够多的不同 int 对象,它们并不都适合缓存。

r==1r==2,CPython 恰好将这种小整数视为单例,因此,例如,尽管列表中有 1000 万个元素,但在 r==2它仅包含(最多)100 个不同的 int 对象。同时适合缓存的所有数据。

除此之外,您可能会获得越来越多、越来越独特的 int 对象。当访问模式是随机的时,硬件缓存变得越来越无用。

说明:

>>> from random import randint, seed
>>> seed(987987987)
>>> for x in range(1, 9):
...     r = 10 ** x
...     js = [randint(1, r) for _ in range(10_000_000)]
...     unique = set(map(id, js))
...     print(f"{r:12,} {len(unique):12,}")
...     
          10           10
         100          100
       1,000    7,440,909
      10,000    9,744,400
     100,000    9,974,838
   1,000,000    9,997,739
  10,000,000    9,999,908
 100,000,000    9,999,998

答案很可能是数据的局部性。超过一定大小限制的整数是动态分配的。创建列表时,整数对象是从(大部分)附近的内存中分配的。所以当你遍历列表时,东西往往在缓存中,硬件预取器可以把它们放在那里。

在排序的情况下,对象会被打乱,导致更多缓存未命中。

正如其他人所说,缓存未命中。不是 values/sortedness。相同的排序值,但使用新的顺序创建的对象,再次快速(实际上甚至比 not 情况快一点):

s_new = [--x for x in s_yes]

只需选择一种尺码:

For rand 1000000
yes 3.6270992755889893
not 1.198620080947876
new 1.02010178565979

查看一个元素与下一个元素的地址差异(仅 106 个元素)表明,特别是对于 s_new,这些元素在内存中按顺序很好地排列( 99.2% 的时间下一个元素出现在 32 字节之后),而对于 s_yes 他们完全不是(只有 0.01% 出现在 32 字节之后):

s_yes:
    741022 different address differences occurred. Top 5:
    Address difference 32 occurred 102 times.
    Address difference 0 occurred 90 times.
    Address difference 64 occurred 37 times.
    Address difference 96 occurred 17 times.
    Address difference 128 occurred 9 times.

s_not:
    1048 different address differences occurred. Top 5:
    Address difference 32 occurred 906649 times.
    Address difference 96 occurred 8931 times.
    Address difference 64 occurred 1845 times.
    Address difference -32 occurred 1816 times.
    Address difference -64 occurred 1812 times.

s_new:
    19 different address differences occurred. Top 5:
    Address difference 32 occurred 991911 times.
    Address difference 96 occurred 7825 times.
    Address difference -524192 occurred 117 times.
    Address difference 0 occurred 90 times.
    Address difference 64 occurred 37 times.

代码:

from collections import Counter

for s in 's_yes', 's_not', 's_new':
    print(s + ':')
    ids = list(map(id, eval(s)))
    ctr = Counter(j - i for i, j in zip(ids, ids[1:]))
    print('   ', len(ctr), 'different address differences occurred. Top 5:')
    for delta, count in ctr.most_common(5):
        print(f'    Address difference {delta} occurred {count} times.')
    print()