对数组的预计算跨步访问模式会提供更差的性能？

Question

我为 numpy 库编写了一个 c-extension，用于计算特定类型的 bincount。由于没有更好的名字，我们就叫它 fast_compiled 并将方法签名放在 numpy/core/src/multiarray/multiarraymodule.c 里面 array_module_methods:

{"fast_compiled", (PyCFunction)arr_fast_compiled,
    METH_VARARGS | METH_KEYWORDS, NULL},

和numpy/core/src/multiarray/compiled_base.c（和compiled_base.h）里面的实际实现：

NPY_NO_EXPORT PyObject *
arr_fast_compiled(PyObject *NPY_UNUSED(self), PyObject *args, PyObject *kwds)
{
    PyObject *list_obj = NULL, *strides_obj = Py_None;
    PyArrayObject *list_arr = NULL, *ans = NULL, *strides_arr = NULL;

    npy_intp len, ans_size, total_size;
    npy_intp i, j, k;
    double *dans, *weights;
    npy_intp* strides;

    static char *kwlist[] = {"weights", "strides", NULL};

    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O", 
                kwlist, &list_obj, &strides_obj)) {
            goto fail;
    }

    list_arr = (PyArrayObject *)PyArray_ContiguousFromAny(list_obj, NPY_DOUBLE, 2, 2);
    if (list_arr == NULL) {
        goto fail;
    }

    len = PyArray_DIM(list_arr, 0);
    weights = (double *)PyArray_DATA(list_arr);
    ans_size = 2*len-1;

    ans = (PyArrayObject *)PyArray_ZEROS(1, &ans_size, NPY_DOUBLE, 0);
    if (ans == NULL) {
        goto fail;
    }
    dans = (double *)PyArray_DATA(ans);
    NPY_BEGIN_ALLOW_THREADS;
    if (strides_obj == Py_None) {
        for (i = 0; i < len; ++i) {
            k = i * len;
            for (j = i; j < i + len; ++j, ++k) {
                dans[j] += weights[k];
            }
        }
        Py_DECREF(list_arr);
    }
    else {
        total_size = len*len;
        strides_arr = (PyArrayObject *)PyArray_ContiguousFromAny(
                                                strides_obj, NPY_INTP, 1, 1);
        strides = (npy_intp *)PyArray_DATA(strides_arr);

        for (i = 0; i < total_size; ++i) {
            dans[strides[i]] += weights[i];
        }
        Py_DECREF(list_arr);
        Py_DECREF(strides_arr);
    }
    NPY_END_ALLOW_THREADS;
    return (PyObject *)ans;

fail:
    Py_XDECREF(list_arr);
    Py_XDECREF(strides_arr);
    Py_XDECREF(ans);
    return NULL;
}

该方法需要一个必需的位置参数 weights 和一个可选的关键字参数 strides。根据是否指定了 strides，它将使用不同的（等效的）方法来计算答案。

我很好奇为什么预先计算步幅并将其指定为关键字参数比在嵌套 for-loop 中计算步幅慢。 IE。这是为什么：

for (i = 0; i < total_size; ++i) {
    dans[strides[i]] += weights[i];
}

比这慢：

for (i = 0; i < len; ++i) {
    k = i * len;
    for (j = i; j < i + len; ++j, ++k) {
        dans[j] += weights[k];
    }
}

这是我计算基准的方法：

import numpy as np
import perfplot 

def fast_compiled(args):
    A, _ = args
    return np.fast_compiled(A)

def fast_compiled_strides(args):
    A, strides = args
    return np.fast_compiled(A, strides=strides)

def setup(n):
    A = np.random.normal(size=(n, n))
    strides = np.arange(n*n)
    strides = np.lib.stride_tricks.sliding_window_view(strides, (n,))
    strides = strides[:n]
    strides = strides.flatten() # make sure it is continous
    return A, strides

perfplot.show(
    setup=setup,
    kernels=[fast_compiled, fast_compiled_strides],
    n_range=[2 ** k for k in range(3, 15)],
    xlabel='n',
    relative_to=0,
)

Answer 1

fast_compiled 比 fast_compiled_strides 快，因为它处理编译时已知的连续数据，使编译器能够使用 SIMD 指令（例如，通常 SSE 在类 x86 平台或 ARM 平台上的 NEON）。它也应该更快，因为从 L1 缓存检索的数据缓存更少（由于间接需要更多的提取）。

确实，dans[j] += weights[k] 可以通过加载 dans 的 m 项和 weights 的 m 项添加 m 项来矢量化使用一条指令并将 m 项存储回 dans。此解决方案高效且缓存友好。

dans[strides[i]] += weights[i] 无法在大多数主流硬件上有效矢量化。由于间接性，处理器需要从内存层次结构中执行代价高昂的收集，然后进行求和，然后执行同样代价高昂的分散存储。即使 strides 包含连续索引，指令通常比从内存加载 连续数据块 昂贵得多。此外，编译器通常无法对代码进行矢量化，或者只是发现在这种情况下不值得使用 SIMD 指令。因此，生成的代码可能是效率较低的标量代码。

实际上，在具有良好编译标志的现代处理器上，两种代码之间的性能差异应该更大。我怀疑您在这里只在 x86 处理器上使用 SSE，因此理论上加速接近 2，因为可以连续计算 2 个双精度浮点数。但是，使用 AVX/AVX-2 理论上会导致速度接近 4（因为可以连续计算 4 个数字）。最近的 Intel 处理器甚至可以连续计算 8 个双精度浮点数。请注意，计算单精度浮点数也可以使理论上的速度提高 2 倍。这同样适用于其他架构，如带有 NEON 和 SVE 指令集的 ARM 或 POWER。由于未来的处理器可能会使用更广泛的 SIMD 寄存器（因为它们的效率），因此编写 SIMD 友好的代码非常重要。

Answer 2

这就是 fast_compiled 的 内部循环 的样子：

dans[j] += weights[k];

在汇编中：

│   0x7ffff6c35260 <arr_fast_compiled+336>  movsd  (%rbx,%rdx,8),%xmm0                                                                                                                                            │
│   0x7ffff6c35265 <arr_fast_compiled+341>  addsd  0x0(%rbp,%rdx,8),%xmm0                                                                                                                                         │
│   0x7ffff6c3526b <arr_fast_compiled+347>  movsd  %xmm0,(%rbx,%rdx,8)                                                                                                                                            │
│   0x7ffff6c35270 <arr_fast_compiled+352>  add    [=11=]x1,%rdx                                                                                                                                                      │
│   0x7ffff6c35274 <arr_fast_compiled+356>  cmp    %rcx,%rdx                                                                                                                                                      │
│   0x7ffff6c35277 <arr_fast_compiled+359>  jne    0x7ffff6c35260 <arr_fast_compiled+336>

前 3 条指令是来自 SSE2 指令集的 SIMD 指令。

fast_compiled_strides 的循环体如下所示：

dans[strides[i]] += weights[i];

在汇编中：

│   0x7ffff6c35330 <arr_fast_compiled+544>  mov    (%rsi,%rdx,8),%rcx                                                                                                                                     │
│   0x7ffff6c35334 <arr_fast_compiled+548>  lea    (%rbx,%rcx,8),%rcx                                                                                                                                     │
│   0x7ffff6c35338 <arr_fast_compiled+552>  movsd  (%rcx),%xmm0                                                                                                                                           │
│   0x7ffff6c3533c <arr_fast_compiled+556>  addsd  0x0(%rbp,%rdx,8),%xmm0                                                                                                                                 │
│   0x7ffff6c35342 <arr_fast_compiled+562>  add    [=13=]x1,%rdx                                                                                                                                              │
│   0x7ffff6c35346 <arr_fast_compiled+566>  movsd  %xmm0,(%rcx)                                                                                                                                           │
│   0x7ffff6c3534a <arr_fast_compiled+570>  cmp    %rdx,%r14                                                                                                                                              │
│   0x7ffff6c3534d <arr_fast_compiled+573>  jne    0x7ffff6c35330 <arr_fast_compiled+544>

看起来和fast_compiled很像，只是它需要在开始时加载strides。但更值得注意的是，它们似乎都使用相同的 SIMD 指令 movsd、addsd 和 movsd ?

我运行 numpy 和 gdb 的使用方式与 the numpy dev. setup guide 中描述的相同，因此我假设所有编译标志都已设置。

我正在使用英特尔 cpu 应该可以访问 SSE4.2 和 AVX2:

sudo lshw -c cpu

输出：

  *-cpu                     
       description: CPU
       product: Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
       vendor: Intel Corp.
       physical id: e
       bus info: cpu@0
       version: Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
       slot: CPUSocket
       size: 3727MHz
       capacity: 3900MHz
       width: 64 bits
       clock: 100MHz
       capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d cpufreq
       configuration: cores=4 enabledcores=4 threads=4

对数组的预计算跨步访问模式会提供更差的性能？

Precomputing strided access pattern to array gives worse performance?

c

python

numpy

python-c-api