用 cuda python 求和矢量值

sum vectors values with cuda python

我尝试使用 CUDA python 对许多向量值求和。我找到了一种使用共享内存 的解决方案。有没有办法在没有共享内存的情况下做到这一点[因为共享内存的内存量很小]?我的向量大小是:

N = 1000
i = 300000
v[i] = [1,2,..., N]

结果我需要得到:

out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]

感谢任何建议:)

要一次执行多个缩减,并且对于您指出的问题维度,了解您的向量是按行存储在内存中还是按列存储很重要。

对于行式存储方法,块式并行归约方法应该非常快。每个块将对单个向量执行 a standard sweep-based parallel reduction,然后将结果作为单个数字写入输出。

对于列式存储方法,对于您指出的问题维度(特别是"large"个向量),每个线程[=25]将是高效的=] 使用遍历列的简单循环对向量执行缩减。

这是两种方法的一个有效示例:

# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32

#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
    sm = cuda.shared.array(threadsperblock, float32)
    bid = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
    lid = tid
    sm[lid] = 0
    while lid < N:
        sm[tid] += vecs[bid, lid];
        lid += bdim
    cuda.syncthreads()
# perform shared memory sweep reduction
    sweep = bdim//2
    while sweep > 0:
        if tid < sweep:
            sm[tid] += sm[tid + sweep]
        sweep = sweep//2
        cuda.syncthreads()
    if tid == 0:
        sums[bid] = sm[0]

#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
    idx = cuda.grid(1)
    if idx >= NV:
        return
    temp = 0
    for i in range(N):
        temp += vecs[i,idx]
    sums[idx] = temp

#peform row-test
rvecs  = np.ones((NV, N), dtype=np.float32)
sums   = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])

#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.20%  1.12464s         2  562.32ms  557.25ms  567.39ms  [CUDA memcpy HtoD]
                    0.59%  6.6881ms         1  6.6881ms  6.6881ms  6.6881ms  cudapy::__main__::vec_sum_row1(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.20%  2.2250ms         1  2.2250ms  2.2250ms  2.2250ms  cudapy::__main__::vec_sum_col2(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.02%  212.83us         2  106.42us  104.45us  108.38us  [CUDA memcpy DtoH]
      API calls:   73.60%  1.12571s         2  562.85ms  557.77ms  567.94ms  cuMemcpyHtoD
                   25.30%  386.91ms         1  386.91ms  386.91ms  386.91ms  cuDevicePrimaryCtxRetain
                    0.64%  9.8042ms         2  4.9021ms  2.6113ms  7.1929ms  cuMemcpyDtoH
                    0.23%  3.4945ms         3  1.1648ms  182.38us  1.6636ms  cuMemAlloc
                    0.07%  999.98us         2  499.99us  62.409us  937.57us  cuLinkCreate
                    0.04%  678.12us         2  339.06us  331.01us  347.12us  cuModuleLoadDataEx
                    0.03%  458.51us         1  458.51us  458.51us  458.51us  cuMemGetInfo
                    0.03%  431.28us         4  107.82us  98.862us  120.58us  cuDeviceGetName
                    0.03%  409.59us         2  204.79us  200.33us  209.26us  cuLinkAddData
                    0.03%  393.75us         2  196.87us  185.18us  208.56us  cuLinkComplete
                    0.01%  218.68us         2  109.34us  79.726us  138.96us  cuLaunchKernel
                    0.00%  14.052us         3  4.6840us     406ns  11.886us  cuDeviceGetCount
                    0.00%  13.391us        12  1.1150us     682ns  1.5910us  cuDeviceGetAttribute
                    0.00%  13.207us         8  1.6500us  1.0110us  3.1970us  cuDeviceGet
                    0.00%  6.6800us        10     668ns     366ns  1.6910us  cuFuncGetAttribute
                    0.00%  6.3560us         1  6.3560us  6.3560us  6.3560us  cuCtxPushCurrent
                    0.00%  4.1940us         2  2.0970us  1.9810us  2.2130us  cuModuleGetFunction
                    0.00%  4.0220us         4  1.0050us     740ns  1.7010us  cuDeviceComputeCapability
                    0.00%  2.5810us         2  1.2900us  1.1740us  1.4070us  cuLinkDestroy
#

如果可以选择存储方式,为了性能,列式存储是首选。在上面的示例中,行和内核花费了大约 6.7 毫秒,而列和内核花费了大约 2.2 毫秒。通过启动较少数量的块并让每个块使用循环执行多次缩减,可以稍微改进上面的行方式方法,但它不太可能比列方法更快。

请注意,此代码每个测试(行和列)需要大约 1.5GB 的存储空间,因此它不会 运行 在内存量非常小的 GPU 上(例如 2GB)显卡)。例如,您可以通过仅进行行测试或列测试,或者通过减少向量的数量,在小内存 GPU 上将其达到 运行。