cupy.var (方差) 性能比 numpy.var 慢得多试图理解为什么

Question

我希望将我的自定义摄像头视频管道移动到结合使用 numba 和 cupy 的视频内存，并尽可能避免将数据传回主机内存。作为这样做的一部分，我需要移植我的锐度检测例程以使用 cuda。最简单的方法似乎是使用 cupy，因为我所做的一切都是计算每个图像的 laplace t运行sform 的方差。我遇到的麻烦是 cupy 方差计算似乎比 numpy 慢 8 倍。这包括将设备 ndarray 复制到主机并使用 numpy 在 cpu 上执行方差计算所需的时间。我希望能更好地理解为什么 cupy 在 GPU 上使用的方差计算 ReductionKernel 如此慢。我将从下面的测试 I 运行开始。

import cupy as cp
import numpy as np
from numba import cuda
import cv2
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
    

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        np.random.seed(0)
        x = np.random.randn(*(s,s)).astype(np.uint16)
        x_nb = cuda.to_device(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        x_cp = cp.asarray(x_nb)
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f'CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')

这个脚本的输出结果如下

NumPy: 100x100 0.00006 +- 0.00000
NumPy: 500x500 0.00053 +- 0.00008
NumPy: 1000x1000 0.00243 +- 0.00029
NumPy: 2000x2000 0.01073 +- 0.00136
NumPy: 5000x5000 0.07470 +- 0.00264
NumPy: 10000x10000 0.28578 +- 0.00313
...
CuPy: 100x100 0.00026 +- 0.00002
CuPy: 500x500 0.00463 +- 0.00068
CuPy: 1000x1000 0.02308 +- 0.00060
CuPy: 2000x2000 0.07876 +- 0.00172
CuPy: 5000x5000 0.50830 +- 0.02237
CuPy: 10000x10000 1.98131 +- 0.03540

我还尝试使用 float16 和 float32 而不是 uint16（因为它看起来像 cupy 使用的缩减内核与浮点数一起工作，但它并没有以任何对我有意义的方式改变差异）。
https://github.com/cupy/cupy/blob/04bf15706474a5e79ba196e70784a147cad6e26e/cupy/_core/_routines_statistics.pyx#L542

以下是与我的工作环境相关的一些版本。

>>> numpy.__version__
'1.18.5'
>>> cupy.__version__
'9.6.0'
>>> numba.__version__
'0.53.1'
Python 3.6.9
Driver Version: 470.82.01 
CUDA Version: 11.4

任何有关可能导致 cupy 表现如此糟糕的提示，我们将不胜感激。此外，如果有任何事情可以提高我很想知道的性能。我尝试阅读 ReductionKernels 以了解如何优化它们，但我无法理解。 https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

根据@myrtlecat

的反馈更新了结果

with dtype np.float32
CuPy (1 axis at a time): 100x100 0.00017 +- 0.00002
CuPy (1 axis at a time): 500x500 0.00041 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00097 +- 0.00003
CuPy (1 axis at a time): 2000x2000 0.00278 +- 0.00016
CuPy (1 axis at a time): 5000x5000 0.01381 +- 0.00041
CuPy (1 axis at a time): 10000x10000 0.04313 +- 0.00355
CuPy: 100x100 0.00013 +- 0.00000
CuPy: 500x500 0.00432 +- 0.00013
CuPy: 1000x1000 0.01713 +- 0.00070
CuPy: 2000x2000 0.06713 +- 0.00079
CuPy: 5000x5000 0.41975 +- 0.00259
CuPy: 10000x10000 1.69374 +- 0.01938
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00022 +- 0.00001
NumPy: 1000x1000 0.00121 +- 0.00018
NumPy: 2000x2000 0.00530 +- 0.00047
NumPy: 5000x5000 0.03432 +- 0.00179
NumPy: 10000x10000 0.14227 +- 0.00503

with dtype np.uint16
CuPy (1 axis at a time): 100x100 0.00018 +- 0.00000
CuPy (1 axis at a time): 500x500 0.00124 +- 0.00003
CuPy (1 axis at a time): 1000x1000 0.00430 +- 0.00020
CuPy (1 axis at a time): 2000x2000 0.01537 +- 0.00008
CuPy (1 axis at a time): 5000x5000 0.07413 +- 0.00330
CuPy (1 axis at a time): 10000x10000 0.29842 +- 0.01291
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00359 +- 0.00053
CuPy: 1000x1000 0.01952 +- 0.00058
CuPy: 2000x2000 0.07719 +- 0.00076
CuPy: 5000x5000 0.48284 +- 0.00169
CuPy: 10000x10000 1.96746 +- 0.04353
NumPy: 100x100 0.00006 +- 0.00002
NumPy: 500x500 0.00053 +- 0.00010
NumPy: 1000x1000 0.00224 +- 0.00016
NumPy: 2000x2000 0.00956 +- 0.00034
NumPy: 5000x5000 0.06818 +- 0.00210
NumPy: 10000x10000 0.27071 +- 0.00747

根据@myrtlecat

的反馈更新了脚本

import cupy as cp
import numpy as np
from timeit import default_timer as timer

n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)


dtype = np.uint16
# dtype = np.float32


def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x


def var(x):
    return mean((x - mean(x)) ** 2)


# CuPy (1 axis at a time)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (1 axis at a time): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# CuPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = cp.var(x_cp)
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

# NumPy
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        t_start = timer()
        _ = np.var(x)
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

最后更新的脚本和结果基于@myrtlecat 的反馈，他说这可能是二维数组的问题所以我尝试在调用“var”之前使用 reshape 来展平数组

# CuPy (flattened)
for s in sizes:
    t_cp = np.zeros(n_tot)
    for n in range(n_tot):
        # np.random.seed(0)
        x = np.random.randn(*(s, s)).astype(dtype)
        x_cp = cp.asarray(x)
        cp.cuda.Stream.null.synchronize()
        t_start = timer()
        _ = var(x_cp.reshape((s * s,)))
        cp.cuda.Stream.null.synchronize()
        t_cp[n] = timer() - t_start
    t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
    print(f"CuPy (flattened): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")

结果

for uint16
CuPy (flattened): 100x100 0.00018 +- 0.00006
CuPy (flattened): 500x500 0.00107 +- 0.00002
CuPy (flattened): 1000x1000 0.00414 +- 0.00020
CuPy (flattened): 2000x2000 0.01550 +- 0.00036
CuPy (flattened): 5000x5000 0.10017 +- 0.00525
CuPy (flattened): 10000x10000 0.39470 +- 0.01606
CuPy (1 axis at a time): 100x100 0.00026 +- 0.00008
CuPy (1 axis at a time): 500x500 0.00104 +- 0.00005
CuPy (1 axis at a time): 1000x1000 0.00368 +- 0.00028
CuPy (1 axis at a time): 2000x2000 0.01364 +- 0.00055
CuPy (1 axis at a time): 5000x5000 0.07639 +- 0.00311
CuPy (1 axis at a time): 10000x10000 0.29405 +- 0.00419

for float32
CuPy (flattened): 100x100 0.00015 +- 0.00007
CuPy (flattened): 500x500 0.00043 +- 0.00001
CuPy (flattened): 1000x1000 0.00159 +- 0.00003
CuPy (flattened): 2000x2000 0.00631 +- 0.00030
CuPy (flattened): 5000x5000 0.03827 +- 0.00135
CuPy (flattened): 10000x10000 0.13173 +- 0.00579
CuPy (1 axis at a time): 100x100 0.00020 +- 0.00005
CuPy (1 axis at a time): 500x500 0.00030 +- 0.00004
CuPy (1 axis at a time): 1000x1000 0.00077 +- 0.00002
CuPy (1 axis at a time): 2000x2000 0.00215 +- 0.00002
CuPy (1 axis at a time): 5000x5000 0.01387 +- 0.00049
CuPy (1 axis at a time): 10000x10000 0.04099 +- 0.00142

因此，对于 uint16 和 float32，使用建议的一次 1 轴技术，结果似乎更快，而不是通过重塑来展平二维数组。尽管重塑比传递二维数组要快得多。

Answer 1

我对这个问题有一个部分假设（不是完整的解释）和一个解决方法。也许有人可以填补空白。为了简洁起见，我使用了一个更快更脏的基准。

解决方法：一次减少一个轴

每次在一个轴上执行缩减时，Cupy 快得多。代替：

x.sum()

更喜欢这个：

x.sum(-1).sum(-1).sum(-1)...

请注意，由于舍入误差，这些计算结果可能会有所不同。

这里有更快的 mean 和 var 函数：

def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

基准

import numpy as np
import cupy as cp
from timeit import timeit

def mean(x):
    while x.size > 1:
        x = x.mean(-1)
    return x

def var(x):
    return mean((x - mean(x)) ** 2)

def benchmark(label, f):
    t = timeit(f, number=10) / 10
    print(f"{label.ljust(25)}{t:0.4f}s")

x = np.random.rand(5000, 5000).astype('f4')
x_cp = cp.array(x)

benchmark("Numpy", lambda: x.var())
benchmark("Cupy", lambda: float(x_cp.var()))
benchmark("Cupy (1 axis at a time)", lambda: float(var(x_cp)))

产生超过 100 倍的加速：

Numpy                    0.0469s
Cupy                     0.3805s
Cupy (1 axis at a time)  0.0013s

由于四舍五入，答案很接近但不完全相同：

> x.var(), float(x_cp.var()), float(var(x_cp))
(0.08334004, 0.08333975821733475, 0.08334004133939743)

为什么更快？

支持 CUDA 的 GPU 在单线程上运行时通常比现代 CPUs 慢，但它们通过并行执行许多相同的操作来实现高吞吐量。

像sum或var这样的缩减可以通过以下方式并行化：

将输入分成块
并行对块执行缩减
对每个块的结果执行第二次缩减

对于足够大的输入数组，如果块大小选择得当，这将接近最佳性能。（注意：对于小型阵列，或压榨每一滴性能，需要更高级的技术，如链接到的幻灯片 OP）。

我认为这是一个错误...

我认为 cupy 应该将这种技术应用于所有归约（我对 cupy 代码的阅读是它确实这样做了）。但是，无论出于何种原因，它都可以很好地在单个轴上并行化约简，但不能在整个阵列上并行化约简。我怀疑这是有意为之的行为，但我不是 cupy 的开发者，所以也许是。生成的 CUDA 代码和用于调用缩减内核的参数非常隐蔽，所以我不确定每种情况下到底发生了什么。

更新基准

运行 uint16 的更新基准脚本的结果：

CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00029 +- 0.00000
CuPy (1 axis at a time): 1000x1000 0.00070 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00242 +- 0.00001
CuPy (1 axis at a time): 5000x5000 0.01410 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.05145 +- 0.00149
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00316 +- 0.00000
CuPy: 1000x1000 0.01250 +- 0.00001
CuPy: 2000x2000 0.07283 +- 0.00290
CuPy: 5000x5000 0.44025 +- 0.00012
CuPy: 10000x10000 1.76455 +- 0.00190
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00056 +- 0.00001
NumPy: 1000x1000 0.00201 +- 0.00001
NumPy: 2000x2000 0.01066 +- 0.00005
NumPy: 5000x5000 0.08828 +- 0.00007
NumPy: 10000x10000 0.35403 +- 0.00064

所以比 numpy 提速大约 7 倍，并且 float32:

CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00018 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00021 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00052 +- 0.00000
CuPy (1 axis at a time): 5000x5000 0.00232 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.00837 +- 0.00002
CuPy: 100x100 0.00015 +- 0.00000
CuPy: 500x500 0.00281 +- 0.00001
CuPy: 1000x1000 0.01657 +- 0.00026
CuPy: 2000x2000 0.06557 +- 0.00003
CuPy: 5000x5000 0.37905 +- 0.00007
CuPy: 10000x10000 1.51899 +- 0.01084
NumPy: 100x100 0.00003 +- 0.00000
NumPy: 500x500 0.00032 +- 0.00000
NumPy: 1000x1000 0.00115 +- 0.00001
NumPy: 2000x2000 0.00581 +- 0.00010
NumPy: 5000x5000 0.04730 +- 0.00009
NumPy: 10000x10000 0.19188 +- 0.00024

大约比 numpy 提速 40 倍。

结果将取决于平台：为了比较，我的 CPU 是 Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz，我的 GPU 是 NVIDIA GeForce GTX 1080钛

cupy.var (方差) 性能比 numpy.var 慢得多试图理解为什么

cupy.var (variance) performance much slower than numpy.var trying to understand why

python

cupy

解决方法：一次减少一个轴

基准

为什么更快？

我认为这是一个错误...

更新基准