为什么 python lru_cache 在 maxsize 是二次方时表现最好？

Question

If maxsize is set to None, the LRU feature is disabled and the cache can grow without bound. The LRU feature performs best when maxsize is a power-of-two.

有谁知道这个“2 的幂”是从哪里来的？我猜这与实现有关。

Answer 1

TL;DR - 这是一项优化，在 lru_cache 尺寸较小时效果不佳，但（请参阅 Raymond 的回复）随着 lru_cache 尺寸变大，效果会更大。

所以这激起了我的兴趣，我决定看看这是不是真的。

首先我去阅读了 LRU 缓存的源代码。 cpython 的实现在这里：https://github.com/python/cpython/blob/master/Lib/functools.py#L723 并且我没有看到任何让我觉得基于 2 的幂会更好地运行的东西。

所以，我写了一个简短的 python 程序来制作各种大小的 LRU 缓存，然后多次运行这些缓存。这是代码：

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Calculate the value for a range larger than our largest cache
        for k in range(2000):
            perform_calc(k)

for t in range(10):
    print (t)
    values = defaultdict(list)
    for i in range(1,1025):
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

我运行这在轻负载的 macbook pro 上 python 3.7.7.

结果如下：

https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing

运行dom 峰值可能是由于 GC 暂停或系统中断造成的。

此时我意识到我的代码总是生成缓存未命中，而从不生成缓存命中。如果我们运行同样的事情，但总是命中缓存会怎样？

我将内部循环替换为：

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Only ever create cache hits
        for k in range(i):
            perform_calc(k)

此数据位于与上面相同的电子表格中，第二个选项卡。

让我们看看：

嗯，但我们并不真正关心这些数字中的大部分。另外，我们没有为每个测试做相同数量的工作，所以时间似乎没有用。

如果我们运行它仅用于 2^n 2^n + 1 和 2^n - 1 会怎么样。因为这加快了速度，我们将对 100 次测试进行平均，而不是只有 10.

我们还将生成一个大型运行dom 列表到运行，因为这样我们会期望有一些缓存命中和缓存未命中。

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random

rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        for k in rands:
            perform_calc(k)

for t in range(100):
    print (t)
    values = defaultdict(list)
    # Interesting numbers, and how many random elements to generate
    for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

此数据位于上面电子表格的第三个选项卡中。

这是每个元素的平均时间/lru 缓存大小的图表：

当然，随着我们的缓存大小变大，时间会减少，因为我们不会花那么多时间执行计算。有趣的是，似乎确实有从 15 到 16、17 和 31 到 32、33 的下降。让我们放大更高的数字：

我们不仅在更高的数字中失去了这种模式，而且我们实际上看到性能下降对于某些 2 的幂（511 到 512、513）。

编辑：关于二次方的注释是 added in 2012, but the algorithm for functools.lru_cache looks the same at that commit，所以不幸的是，这推翻了我的理论，即算法已更改且文档已过时。

编辑：删除了我的假设。原作者在上面回复 - 我的代码的问题是我正在使用 "small" 缓存 - 这意味着 O(n) 调整字典的大小并不是很昂贵。用非常大的 lru_caches 和大量的缓存未命中进行试验，看看我们是否能让效果出现，这会很酷。

Answer 2

尺寸效应出现的地方

lru_cache() code 以一种非典型的方式使用它的基础字典。在保持总大小不变的同时，缓存未命中会删除最旧的项目并插入新项目。

二次幂建议是这种删除和插入模式如何与基础 dictionary implementation 交互的产物。

词典的工作原理

Table 大小是 2 的幂。
删除的键被替换为虚拟项。
新密钥有时可以重复使用 dummy 插槽，有时则不能。
使用不同的键重复删除和插入将用虚拟条目填充 table。
当 table 已满三分之二时，将运行 O(N) 调整大小操作。
由于 active 条目的数量保持不变，调整大小操作实际上不会更改 table 大小。
调整大小的唯一作用是清除累积的虚拟条目。

性能影响

具有 2**n 个条目的字典对虚拟个条目具有最可用的 space，因此 O(n) 调整大小发生的频率较低。

此外，sparse dictionaries have fewer hash collisions 比大多数完整的词典都好。碰撞会降低字典性能。

重要的时候

lru_cache() 仅在缓存未命中时才更新字典。此外，当出现未命中时，将调用包装函数。因此，只有在未命中率很高且包装函数非常便宜的情况下，调整大小的效果才会重要。

远比给 maxsize 一个二次方更重要的是使用最大合理的 maxsize。更大的缓存有更多的缓存命中 — 这就是大赢的来源。

模拟

一旦 lru_cache() 已满并且发生了第一次调整大小，字典就会进入稳定状态并且永远不会变大。在这里，我们模拟了接下来会发生什么，因为添加了新的虚拟条目并定期调整大小以清除它们。

steady_state_dict_size = 2 ** 7     # always a power of two

def simulate_lru_cache(lru_maxsize, events=1_000_000):
    'Count resize operations as dummy keys are added'
    resize_point = steady_state_dict_size * 2 // 3
    assert lru_maxsize < resize_point
    dummies = 0
    resizes = 0
    for i in range(events):
        dummies += 1
        filled = lru_maxsize + dummies
        if filled >= resize_point:
           dummies = 0
           resizes += 1
    work = resizes * lru_maxsize    # resizing is O(n)
    work_per_event = work / events
    print(lru_maxsize, '-->', resizes, work_per_event)

这是输出的摘录：

for maxsize in range(42, 85):
    simulate_lru_cache(maxsize)

42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
  ...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0

这表明当 maxsize 尽可能远离 resize_point[=83= 时，缓存所做的工作明显减少].

历史

在 Python3.2 中影响很小，当调整大小时词典增加了 4 x active_entries。

增长率降低到2 x active entries时的效果became catastrophic。

稍后a compromise was reached，将增长率设置为3 x used。通过默认为我们提供更大的稳态大小，这显着缓解了这个问题。

2 的幂 maxsize 仍然是最佳设置，对于给定的稳态字典大小给出最少的工作，但它不再像以前那样重要在 Python3.2.

希望这有助于理清您的理解。 :-)

为什么 python lru_cache 在 maxsize 是二次方时表现最好？

Why does python lru_cache performs best when maxsize is a power-of-two?

caching

lru

python-3.x

尺寸效应出现的地方

词典的工作原理

性能影响

重要的时候

模拟

历史