CUDA 目标的 Numba 和 guvectorize：代码运行比预期慢

Question

值得注意的细节

大型数据集（1000 万 x 5），（200 x 1000 万 x 5）
主要是 Numpy
每运行
使用 Spyder3
Windows 10

第一件事是尝试将 guvectorize 与以下函数结合使用。我传递了一堆 numpy 数组并试图使用它们来乘以两个数组。如果运行的目标不是 cuda，则此方法有效。但是，当切换到 cuda 时，会导致未知错误：

File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'

在按照我能从这个错误中找到的所有内容后，我除了死胡同外一无所获。我猜这是一个非常简单的修复程序，我完全不知道，但是哦，好吧。还应该说，这个错误只发生在运行一次并由于内存过载而崩溃之后。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有数组都是numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用 nvprofiler 运行代码会导致以下错误：

Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

我意识到我正在使用支持 SLI 的显卡（两张卡是相同的，evga gtx 1080ti，并且具有相同的设备 ID），所以我禁用了 SLI 并添加了 "CUDA_VISIBLE_DEVICES" 行来尝试限制到另一张卡，但得到相同的结果。

我仍然可以运行使用 nvprof 编写代码，但是与 njit(parallel=True) 和 prange 相比，cuda 函数速度较慢。通过使用较小的数据大小，我们可以运行代码，但它比 target='parallel' 和 target='cpu'.

慢

为什么 cuda 这么慢，这些错误是什么意思？

感谢您的帮助！

编辑：这是代码的一个工作示例：

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我可以使用 gtx 1080ti 运行 cuda 中的代码，但是，它比运行并行或 cpu 慢得多。我查看了与 guvectorize 有关的其他帖子，但其中 none 帮助我了解什么是 guvectorize 和什么不是运行的最佳选择。有没有什么方法可以使此代码 'cuda friendly'，或者只是跨数组进行乘法太简单以至于看不到任何好处？

Answer 1

gufunc Numba 发出和运行s 如此慢的原因在分析时变得很明显（numba 0.38.1 与 CUDA 8.0）

==24691== Profiling application: python slowvec.py
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount2(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

运行代码使用单个 64 线程块启动内核。在理论上每个 MP 最多可以有 2048 个线程和 23 MP 的 GPU 上，这意味着 GPU 理论处理能力的大约 99.9% 没有被使用。这看起来像是 numba 开发人员的一个荒谬的设计选择，如果你被它阻碍了（看起来你确实是），我会把它报告为一个错误。

显而易见的解决方案是将您的函数重写为 CUDA python 内核方言中的 @cuda.jit 函数，并显式控制执行参数。这样你至少可以确保代码运行有足够的线程来潜在地使用你的硬件的所有容量。它仍然是一个非常受内存限制的操作，因此您在加速方面可以实现的目标可能会受到限制，远低于您的 GPU 内存带宽与 CPU 的比率。这可能不足以分摊主机到设备内存传输的成本，因此在最好的情况下可能没有性能提升，即使这远非如此。

简而言之，当心自动编译器生成并行性的危险....

要补充的后记，我设法弄清楚了如何获取 numba 发出的代码的 PTX，并且足以说它绝对是 craptulacular（这么久我实际上无法 post 所有它):

{
    .reg .pred  %p<9>;
    .reg .b32   %r<8>;
    .reg .f64   %fd<4>;
    .reg .b64   %rd<137>;


    ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
    ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
    ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
    ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
    ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
    ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
    ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
    ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
    ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
    ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
    ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
    ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
    ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
    ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
    ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
    ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
    ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
    ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
    ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
    ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
    ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
    ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
    ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
    ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
    ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
    ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
    mov.u32     %r1, %tid.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r2, %ntid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    min.s64     %rd62, %rd32, %rd29;
    min.s64     %rd63, %rd39, %rd62;
    min.s64     %rd64, %rd48, %rd63;
    min.s64     %rd65, %rd51, %rd64;
    min.s64     %rd66, %rd54, %rd65;
    cvt.s64.s32 %rd1, %r4;
    setp.le.s64 %p2, %rd66, %rd1;
    @%p2 bra    BB0_8;

    ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
    ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
    ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
    ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
    ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
    ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
    cvt.u32.u64 %r5, %rd1;
    setp.lt.s32 %p1, %r5, 0;
    selp.b64    %rd67, %rd29, 0, %p1;
    add.s64     %rd68, %rd67, %rd1;
    mul.lo.s64  %rd69, %rd68, %rd121;
    add.s64     %rd70, %rd69, %rd122;
    selp.b64    %rd71, %rd48, 0, %p1;
    add.s64     %rd72, %rd71, %rd1;
    mul.lo.s64  %rd73, %rd72, %rd123;
    add.s64     %rd74, %rd73, %rd124;
    ld.u64  %rd2, [%rd74];
    selp.b64    %rd75, %rd51, 0, %p1;
    add.s64     %rd76, %rd75, %rd1;
    mul.lo.s64  %rd77, %rd76, %rd125;
    add.s64     %rd78, %rd77, %rd126;
    ld.u64  %rd3, [%rd78];
    ld.u64  %rd4, [%rd70];
    setp.lt.s64 %p3, %rd4, 1;
    @%p3 bra    BB0_8;

    ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
    ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount2E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    selp.b64    %rd80, %rd127, 0, %p1;
    mov.u64     %rd79, 0;
    min.s64     %rd81, %rd128, %rd79;
    min.s64     %rd82, %rd34, %rd79;
    selp.b64    %rd83, %rd39, 0, %p1;
    min.s64     %rd84, %rd40, %rd79;
    min.s64     %rd85, %rd41, %rd79;
    min.s64     %rd86, %rd42, %rd79;
    selp.b64    %rd87, %rd54, 0, %p1;
    min.s64     %rd88, %rd55, %rd79;
    min.s64     %rd89, %rd56, %rd79;
    min.s64     %rd90, %rd57, %rd79;
    mul.lo.s64  %rd91, %rd90, %rd61;
    add.s64     %rd92, %rd53, %rd91;
    mul.lo.s64  %rd93, %rd89, %rd60;
    add.s64     %rd94, %rd92, %rd93;
    mul.lo.s64  %rd95, %rd88, %rd59;
    add.s64     %rd96, %rd94, %rd95;
    add.s64     %rd98, %rd87, %rd1;
    mul.lo.s64  %rd99, %rd58, %rd98;
    add.s64     %rd5, %rd96, %rd99;
    mul.lo.s64  %rd100, %rd86, %rd46;
    add.s64     %rd101, %rd38, %rd100;
    mul.lo.s64  %rd102, %rd85, %rd45;
    add.s64     %rd103, %rd101, %rd102;
    mul.lo.s64  %rd104, %rd84, %rd44;
    add.s64     %rd105, %rd103, %rd104;
    add.s64     %rd106, %rd83, %rd1;
    mul.lo.s64  %rd107, %rd43, %rd106;
    add.s64     %rd6, %rd105, %rd107;
    mul.lo.s64  %rd108, %rd82, %rd37;
    add.s64     %rd109, %rd31, %rd108;
    mul.lo.s64  %rd110, %rd81, %rd36;
    add.s64     %rd111, %rd109, %rd110;
    add.s64     %rd112, %rd80, %rd1;
    mul.lo.s64  %rd113, %rd35, %rd112;
    add.s64     %rd7, %rd111, %rd113;
    add.s64     %rd8, %rd2, 1;
    mov.u64     %rd131, %rd79;

BB0_3:
    mul.lo.s64  %rd115, %rd59, %rd131;
    add.s64     %rd10, %rd5, %rd115;
    mul.lo.s64  %rd116, %rd44, %rd131;
    add.s64     %rd11, %rd6, %rd116;
    setp.lt.s64 %p4, %rd3, 1;
    mov.u64     %rd130, %rd79;
    mov.u64     %rd132, %rd3;
    @%p4 bra    BB0_7;

BB0_4:
    mov.u64     %rd13, %rd132;
    mov.u64     %rd12, %rd130;
    mul.lo.s64  %rd117, %rd60, %rd12;
    add.s64     %rd136, %rd10, %rd117;
    mul.lo.s64  %rd118, %rd45, %rd12;
    add.s64     %rd135, %rd11, %rd118;
    mul.lo.s64  %rd119, %rd36, %rd12;
    add.s64     %rd134, %rd7, %rd119;
    setp.lt.s64 %p5, %rd2, 1;
    mov.u64     %rd133, %rd8;
    @%p5 bra    BB0_6;

BB0_5:
    mov.u64     %rd17, %rd133;
    ld.f64  %fd1, [%rd135];
    ld.f64  %fd2, [%rd134];
    mul.f64     %fd3, %fd2, %fd1;
    st.f64  [%rd136], %fd3;
    add.s64     %rd136, %rd136, %rd61;
    add.s64     %rd135, %rd135, %rd46;
    add.s64     %rd134, %rd134, %rd37;
    add.s64     %rd24, %rd17, -1;
    setp.gt.s64 %p6, %rd24, 1;
    mov.u64     %rd133, %rd24;
    @%p6 bra    BB0_5;

BB0_6:
    add.s64     %rd25, %rd13, -1;
    add.s64     %rd26, %rd12, 1;
    setp.gt.s64 %p7, %rd13, 1;
    mov.u64     %rd130, %rd26;
    mov.u64     %rd132, %rd25;
    @%p7 bra    BB0_4;

BB0_7:
    sub.s64     %rd120, %rd4, %rd131;
    add.s64     %rd131, %rd131, 1;
    setp.gt.s64 %p8, %rd120, 1;
    @%p8 bra    BB0_3;

BB0_8:
    ret;
}

所有这些整数运算只执行一次双精度乘法！

Answer 2

首先，您展示的基本操作是获取两个矩阵，将它们传输到 GPU，进行一些元素乘法以生成第三个数组，然后将该第三个数组传回主机。

有可能制作一个 numba/cuda guvectorize（或 cuda.jit 内核）实现，它可能运行比原始串行 python 实现更快，但我怀疑有可能超过编写良好的主机代码的性能（例如使用一些并行化方法，如 guvectorize）来做同样的事情。这是因为在主机和设备之间传输的每字节的算术强度太低了。这个操作太简单了。

其次，我认为，首先了解 numba vectorize 和 guvectorize 的用途很重要。基本原则是从 "what will a worker do?" 的角度编写 ufunc 定义，然后允许 numba 从中启动多个 worker。您指示 numba 启动多个 worker 的方式是 传递一个大于您给定的签名的数据集 。应该注意 numba 不知道如何在 ufunc 定义中并行化 for 循环 。它通过采用您的 ufunc 定义并在并行工作人员中运行使其并行 "strength"，其中每个工作人员处理 "slice" 数据，但运行s 你整个 ufunc 定义 在那个切片上。作为一些额外的阅读，我也涵盖了这方面的一些内容。

所以我们在您的实现中遇到的一个问题是您编写了一个签名（和 ufunc），它将整个输入数据集映射到一个工人。正如@talonmies 所展示的，你的底层内核正在旋转，总共有 64 threads/workers（这在 GPU 上太小了，以至于没有兴趣，即使除了上面关于算术强度的陈述），但我怀疑事实上，64 实际上只是一个 numba 最小线程块大小，事实上，该线程块中只有 1 个线程在做任何有用的计算工作。一个线程正在以串行方式执行整个 ufunc，包括所有 for 循环。

这显然不是任何人合理使用 vectorize 或 guvectorize 的意图。

那么让我们重新审视一下您正在尝试做的事情。最终，您的 ufunc 想要将一个数组的输入值乘以另一个数组的输入值，并将结果存储在第三个数组中。我们想多次重复这个过程。如果所有 3 个数组大小都相同，我们实际上可以用 vectorize 实现这一点，甚至不必求助于更复杂的 guvectorize。让我们将这种方法与您的原始方法进行比较，重点关注 CUDA 内核执行。这是一个有效的示例，其中 t14.py 是您的原始代码，运行带有探查器，t15.py 是它的 vectorize 版本，承认我们已经更改了大小你的 multBy 数组匹配 cv 和 discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount2(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount2(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到您的应用程序报告的运行时间约为 1.244 秒，而矢量化版本报告的运行时间约为 0.375 秒。但是这两个数字都有 python 开销。如果我们在分析器中查看生成的 CUDA 内核持续时间，差异会更加明显。我们看到原始内核花费了大约 1.225 秒，而向量化内核执行了大约 842 微秒（即不到 1 毫秒）。我们还注意到计算内核时间现在比将 3 个数组 to/from 传输到 GPU 所需的时间（总共需要大约 20 毫秒）要小得多，我们注意到内核维度现在是 15625 个块每个 128 个线程，总 thread/worker 计数为 2000000，与要完成的乘法运算总数完全匹配，并且大大超过了原始操作中微不足道的 64 个线程（并且可能实际上只有 1 个线程）代码。

鉴于上述 vectorize 方法的简单性，如果您真正想要做的是这种逐元素乘法，那么您可以考虑只复制 multBy 以便它在维度上匹配其他两个阵列，并完成它。

但问题仍然存在：如何处理与原始问题中不同的输入数组大小？为此，我认为我们需要使用 guvectorize（或者，正如@talonmies 指出的那样，编写您自己的 @cuda.jit 内核，这可能是最好的建议，尽管其中 none方法可以克服传输数据的开销 to/from 设备，如前所述。

为了用 guvectorize 解决这个问题，我们需要更仔细地考虑已经提到的 "slicing" 概念。让我们重新编写您的 guvectorize 内核，使其仅对整体数据的 "slice" 进行操作，然后允许 guvectorize 启动功能启动多个工作人员来解决它，一个工作人员每片。

在 CUDA 中，我们喜欢有很多工人；你真的不能有太多。所以这会影响我们"slice"我们数组的方式，从而为多个worker提供行动的可能性。如果我们要沿着第 3 个（最后一个，n）维度进行切片，我们将只有 5 个切片可以使用，因此最多有 5 个工人。同样，如果我们沿第一个或 countRow 维度切片，我们将有 100 个切片，因此最多有 100 个工人。理想情况下，我们将沿第 2 个或 countCol 维度进行切片。然而，为了简单起见，我将沿第一个或 countRow 维度进行切片。这显然不是最优的，但请参阅下面的工作示例，了解如何处理按第二维切片的问题。按第一维切片意味着我们将从 guvectorize 内核中删除第一个 for 循环，并允许 ufunc 系统沿该维度并行化（基于我们传递的数组大小）。代码可能如下所示：

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount2(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

观察：

代码更改与删除 countCol 参数、从 guvectorize 内核中删除第一个 for 循环以及对函数签名进行适当更改以反映这一点有关。我们还将签名中的3维函数修改为二维。毕竟我们取的是 3 维数据的二维 "slice"，让每个 worker 在一个切片上工作。

探查器报告的内核维度现在是 2 个块而不是 1 个。这是有道理的，因为在最初的实现中，实际上只有 1 个 "slice"，因此需要 1 个工作人员，因此需要 1 个线程（但 numba 启动了 64 个线程的 1 个线程块）。在此实现中，有 100 个切片，numba 选择启动 2 个 64 workers/threads 线程块，以提供所需的 100 workers/threads.

分析器报告的内核性能为 47.4 毫秒，现在介于原始版本（~1.224 秒）和大规模并行 vectorize 版本（~0.001 秒）之间。因此，从 1 个工人增加到 100 个工人已经大大加快了速度，但还有更多的性能提升可能。如果你弄清楚如何在 countCol 维度上进行切片，你可能会更接近 vectorize 版本，性能方面（见下文）。请注意，我们在这里的位置（~47ms）和矢量化版本（~1ms）之间的差异足以弥补传输稍微大一点的额外传输成本（~5ms 或更少）multBy矩阵到设备，以方便vectorize简单。

关于 python 时间的一些附加评论：我相信 python 为原始版本、向量化版本和 guvectorize 改进版本编译必要内核的确切行为是不同的。如果我们将 t15.py 代码修改为运行 a "warm-up" 运行，那么至少 python 时间是一致的，在趋势方面与整体墙时间和仅内核时序：

$ cat t15.py import numpy as np from numba import guvectorize,vectorize import time from timeit import default_timer as timer @vectorize(['float64(float64, float64)'], target='cuda') def cVestDiscount (a, b): return a * b multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) multBy = np.full_like(discount, 1) cv = np.empty_like(discount) #warm-up run cv = cVestDiscount(multBy, discount) func_start = timer() cv = cVestDiscount(multBy, discount) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ time python t14.py Function: discount factor cumVest duration (seconds):1.24376320839 real 0m2.522s user 0m1.572s sys 0m0.809s $ time python t15.py Function: discount factor cumVest duration (seconds):0.0228319168091 real 0m1.050s user 0m0.473s sys 0m0.445s $ time python t16.py Function: discount factor cumVest duration (seconds):0.0665760040283 real 0m1.252s user 0m0.680s sys 0m0.441s $

现在，有效地回答评论中的问题："How would I recast the problem to slice along the 4000 (countCol, or "middle") 维度？

我们可以通过沿第一个维度进行切片的方法来指导。一种可能的方法是重新排列数组的形状，使 4000 维成为第一维，然后将其删除，类似于我们在之前对 guvectorize 的处理中所做的。这是一个有效的例子：

$ cat t17.py import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True) def cVestDiscount (countCol, multBy, discount, n, cv): for ID in range(0,countCol): for num in range(0,n): cv[ID][num] = multBy[num] * discount[ID][num] countRow = np.int64(100) multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(4000,100,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(4000,100,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(countRow, multBy, discount, n, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ python t17.py Function: discount factor cumVest duration (seconds):0.0266749858856 $ nvprof --print-gpu-trace python t17.py ==8544== NVPROF is profiling process 8544, command: python t17.py Function: discount factor cumVest duration (seconds):0.0268459320068 ==8544== Profiling application: python t17.py ==8544== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount2(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35] 322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy $

可以预见的是，我们观察到执行时间从我们切入 100 个工作人员时的 ~47 毫秒下降到我们切入 4000 个工作人员时的~9 毫秒。同样，我们观察到 numba 选择启动 63 个块，每个块 64 个线程，总共 4032 个线程，以处理此 "slicing".
所需的 4000 个工作线程
仍然没有 ~1ms vectorize 内核快（它有更多可用的并行 "slices" 供 worker 使用），但比原来提出的 ~1.2s 内核快很多题。 python 代码的整体 walltime 快了大约 2 倍，即使有所有 python 开销。

作为最后的观察，让我们重新审视一下我之前所做的陈述（与评论和其他答案中的陈述相似）：

"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."

我们现在在 t16.py 或 t17.py 中有方便的测试用例，我们可以使用它们来测试它。为简单起见，我将选择 t16.py。我们可以 "convert this back to a CPU code" 只需从 guvectorize ufunc:
中删除目标指定即可
$ cat t16a.py import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)') def cVestDiscount (multBy, discount, n, countCol, cv): for ID in range(0,countCol): for num in range(0,n): cv[ID][num] = multBy[ID][num] * discount[ID][num] multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(100,4000,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(multBy, discount, n, countCol, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) $ time python t16a.py Function: discount factor cumVest duration (seconds):0.00657796859741 real 0m0.528s user 0m0.474s sys 0m0.047s $

所以我们看到这个CPU-only版本运行s这个函数用了大约6毫秒，它没有GPU"overhead"比如CUDA初始化，数据拷贝to/from 显卡。整体挂起时间也是我们的最佳测量值，约为 0.5 秒，而我们最好的 GPU 案例约为 1.0 秒。所以这个特殊的问题，由于其每字节数据传输的低算术强度，可能不太适合 GPU 计算。

CUDA 目标的 Numba 和 guvectorize：代码 运行 比预期慢

Numba and guvectorize for CUDA target: Code running slower than expected

python

performance

cuda

numba

nvprof

CUDA 目标的 Numba 和 guvectorize：代码运行比预期慢