Python 中的解释与动态调度惩罚

Question

我看了 Brandon Rhodes 关于 Cython 的演讲 - "The Day of the EXE Is Upon Us"。

Brandon 在 09:30 中提到，对于特定的一小段代码，跳过解释可实现 40% 的加速，而跳过分配和分派可实现 574% 的加速 (10:10)。

我的问题是 - 如何针对特定代码段衡量这一点？是否需要手动提取底层的 c 命令，然后以某种方式使运行time 运行它们？

这是一个非常有趣的观察结果，但我该如何重新创建实验？

Answer 1

我们来看看这个python函数：

def py_fun(i,N,step):
     res=0.0
     while i<N:
        res+=i
        i+=step
     return res

并使用ipython-魔法计时：

In [11]: %timeit py_fun(0.0,1.0e5,1.0)
10 loops, best of 3: 25.4 ms per loop

解释器将运行通过生成的字节码并解释它。但是，我们可以使用 cython for/cythonizing 完全相同的代码来删除解释器：

%load_ext Cython
%%cython
def cy_fun(i,N,step):
     res=0.0
     while i<N:
        res+=i
        i+=step
     return res

我们为此提高了 50% 的速度：

In [13]: %timeit cy_fun(0.0,1.0e5,1.0)
100 loops, best of 3: 10.9 ms per loop

当我们查看生成的 C 代码时，我们看到直接调用了正确的函数，而不需要 interpreted/calling ceval，在剥离样板代码之后：

static PyObject *__pyx_pf_4test_cy_fun(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_i, PyObject *__pyx_v_N, PyObject *__pyx_v_step) {
  ...
  while (1) {
    __pyx_t_1 = PyObject_RichCompare(__pyx_v_i, __pyx_v_N, Py_LT); 
    ...
    __pyx_t_2 = __Pyx_PyObject_IsTrue(__pyx_t_1);
    ...
    if (!__pyx_t_2) break;
    ...
    __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_res, __pyx_v_i);
    ...
    __pyx_t_1 = PyNumber_InPlaceAdd(__pyx_v_i, __pyx_v_step); 
  }
  ...
  return __pyx_r;
}

然而，这个 cython 函数处理 python-objects 而不是 c-style floats，所以在函数 PyNumber_InPlaceAdd 中有必要弄清楚这些对象（整数，float，一些东西）否则？）真的是，并将此调用分派给可以完成这项工作的正确函数。

在 cython 的帮助下，我们还可以消除这种分派的需要，直接调用浮点数的乘法：

 %%cython
 def c_fun(double i,double N, double step):
      cdef double res=0.0
      while i<N:
         res+=i
         i+=step
      return res

在此版本中，i、N、step 和 res 是 c 风格的双精度数，不再是 python 对象。所以不再需要像 PyNumber_InPlaceAdd 这样调用 dispatch-functions 但我们可以直接调用 +-operator for double:

static PyObject *__pyx_pf_4test_c_fun(CYTHON_UNUSED PyObject *__pyx_self, double __pyx_v_i, double __pyx_v_N, double __pyx_v_step) {
  ...
  __pyx_v_res = 0.0;  
  ... 
  while (1) {
    __pyx_t_1 = ((__pyx_v_i < __pyx_v_N) != 0);
    if (!__pyx_t_1) break;
    __pyx_v_res = (__pyx_v_res + __pyx_v_i);
    __pyx_v_i = (__pyx_v_i + __pyx_v_step);
  }
  ...
  return __pyx_r;
}

结果是：

In [15]: %timeit c_fun(0.0,1.0e5,1.0)
10000 loops, best of 3: 148 µs per loop

现在，与没有解释器但有调度的版本相比，这是将近 100 的速度提升。

实际上，说 dispatch+allocation 是这里的瓶颈（因为消除它导致了几乎 100 倍的加速）是一个谬论：解释器负责超过 50% 运行时间（15 毫秒）和分派和分配 "only" 10 毫秒。

然而，在性能方面，比 "interpreter" 和动态调度存在更多问题：Float 是不可变的，因此每次它更改时都必须创建一个新对象，并 registered/unregistered 在垃圾收集器中。

我们可以引入可变浮点数，就地改变，不需要registering/unregistering:

%%cython
cdef class MutableFloat: 
 cdef double x      
 def __cinit__(self, x):
    self.x=x         
 def __iadd__(self, MutableFloat other):
    self.x=self.x+other.x
    return self
 def __lt__(MutableFloat self,  MutableFloat other):
    return self.x<other.x
 def __gt__(MutableFloat self, MutableFloat other):
    return self.x>other.x
 def __repr__(self):
    return str(self.x)

时间（现在我用的是不同的机器，所以时间有点不同）：

def py_fun(i,N,step,acc):
        while i<N:
             acc+=i
             i+=step
        return acc

%timeit py_fun(1.0, 5e5,1.0,0.0)
30.2 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each 
%timeit cy_fun(1.0, 5e5,1.0,0.0)
16.9 ms ± 612 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
    ...: .0),MutableFloat(0.0); py_fun(i,N,step,acc)
23 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit i,N,step,acc=MutableFloat(1.0),MutableFloat(5e5),MutableFloat(1
...: .0),MutableFloat(0.0); cy_fun(i,N,step,acc)
11 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

不要忘记重新初始化 i 因为它是可变的！结果

            immutable       mutable
 py_fun       30ms           23ms
 cy_fun       17ms           11ms

所以 registering/unregistering 浮点数需要最多 7 毫秒（大约 20%）（我不确定没有其他东西在起作用）在带有解释器的版本中，超过 33%没有解释器的版本。

现在的样子：

口译员使用了 40% (13/30) 的时间
最多 33% 的时间用于动态调度
最多 20% 的时间用于 creating/deleting 个临时对象
大约 1% 用于算术运算

另一个问题是数据的局部性，这对于内存带宽限制问题变得很明显：如果数据一个接一个地线性处理，则现代缓存工作良好。这对于遍历 std::vector<>（或 array.array）是正确的，但不适用于遍历 python 列表，因为该列表由可以指向内存中任何位置的指针组成。

考虑以下 python 脚本：

#list.py
N=int(1e7)
lst=[0]*int(N)
for i in range(N):
  lst[i]=i
print(sum(lst))

和

#byte
N=int(1e7)
b=bytearray(8*N)
m=memoryview(b).cast('L') #reinterpret as an array of unsigned longs
for i in range(N):
  m[i]=i
print(sum(m))

它们都创建 1e7 整数，第一个版本 Python-整数，第二个版本是连续放置在内存中的低级 c-int。

有趣的是，这些脚本产生了多少缓存未命中 (D)：

valgrind --tool=cachegrind python list.py 
...
D1  misses:        33,964,276  (   27,473,138 rd   +     6,491,138 wr)

对比

valgrind --tool=cachegrind python bytearray.py 
...
D1  misses:         4,796,626  (    2,140,357 rd   +     2,656,269 wr)

这意味着 python 整数的缓存未命中数增加了 8 倍。部分原因是 python 整数需要超过 8 个字节（可能是 32 个字节，即因子 4）内存和（也许不是 100% 确定，因为相邻的整数是在彼此之后创建的，所以机会很高，它们在内存中的某个地方依次存储，需要进一步调查）一些是因为它们在内存中没有对齐，因为 bytearray.[ 的 c 整数就是这种情况。 =39=]

Python 中的解释与动态调度惩罚

Interpretation vs dynamic dispatch penalty in Python

python

dynamic-programming

performance-testing

cython

dispatch