用 cuda python 求和矢量值

Question

我尝试使用 CUDA python 对许多向量值求和。我找到了一种使用共享内存的解决方案。有没有办法在没有共享内存的情况下做到这一点[因为共享内存的内存量很小]？我的向量大小是：

N = 1000
i = 300000
v[i] = [1,2,..., N]

结果我需要得到：

out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]

感谢任何建议:)

Answer 1

要一次执行多个缩减，并且对于您指出的问题维度，了解您的向量是按行存储在内存中还是按列存储很重要。

对于行式存储方法，块式并行归约方法应该非常快。每个块将对单个向量执行 a standard sweep-based parallel reduction，然后将结果作为单个数字写入输出。

对于列式存储方法，对于您指出的问题维度（特别是"large"个向量），每个线程[=25]将是高效的=] 使用遍历列的简单循环对向量执行缩减。

这是两种方法的一个有效示例：

# cat t7.py import numpy as np import numba as nb from numba import cuda,float32,int32 #vector length N = 1000 #number of vectors NV = 300000 #number of threads per block - must be a power of 2 less than or equal to 1024 threadsperblock = 256 #for vectors arranged row-wise @cuda.jit('void(float32[:,:], float32[:])') def vec_sum_row(vecs, sums): sm = cuda.shared.array(threadsperblock, float32) bid = cuda.blockIdx.x tid = cuda.threadIdx.x bdim = cuda.blockDim.x # load shared memory with vector using block-stride loop lid = tid sm[lid] = 0 while lid < N: sm[tid] += vecs[bid, lid]; lid += bdim cuda.syncthreads() # perform shared memory sweep reduction sweep = bdim//2 while sweep > 0: if tid < sweep: sm[tid] += sm[tid + sweep] sweep = sweep//2 cuda.syncthreads() if tid == 0: sums[bid] = sm[0] #for vectors arranged column-wise @cuda.jit('void(float32[:,:], float32[:])') def vec_sum_col(vecs, sums): idx = cuda.grid(1) if idx >= NV: return temp = 0 for i in range(N): temp += vecs[i,idx] sums[idx] = temp #peform row-test rvecs = np.ones((NV, N), dtype=np.float32) sums = np.zeros(NV, dtype=np.float32) d_rvecs = cuda.to_device(rvecs) d_sums = cuda.device_array_like(sums) vec_sum_row[NV, threadsperblock](d_rvecs, d_sums) d_sums.copy_to_host(sums) print(sums[:8]) #perform column-test cvecs = np.ones((N, NV), dtype=np.float32) d_cvecs = cuda.to_device(cvecs) vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums) d_sums.copy_to_host(sums) print(sums[:8]) # python t7.py [1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.] [1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.] # nvprof python t7.py ==5931== NVPROF is profiling process 5931, command: python t7.py [1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.] [1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.] ==5931== Profiling application: python t7.py ==5931== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 99.20% 1.12464s 2 562.32ms 557.25ms 567.39ms [CUDA memcpy HtoD] 0.59% 6.6881ms 1 6.6881ms 6.6881ms 6.6881ms cudapy::__main__::vec_sum_row1(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>) 0.20% 2.2250ms 1 2.2250ms 2.2250ms 2.2250ms cudapy::__main__::vec_sum_col2(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>) 0.02% 212.83us 2 106.42us 104.45us 108.38us [CUDA memcpy DtoH] API calls: 73.60% 1.12571s 2 562.85ms 557.77ms 567.94ms cuMemcpyHtoD 25.30% 386.91ms 1 386.91ms 386.91ms 386.91ms cuDevicePrimaryCtxRetain 0.64% 9.8042ms 2 4.9021ms 2.6113ms 7.1929ms cuMemcpyDtoH 0.23% 3.4945ms 3 1.1648ms 182.38us 1.6636ms cuMemAlloc 0.07% 999.98us 2 499.99us 62.409us 937.57us cuLinkCreate 0.04% 678.12us 2 339.06us 331.01us 347.12us cuModuleLoadDataEx 0.03% 458.51us 1 458.51us 458.51us 458.51us cuMemGetInfo 0.03% 431.28us 4 107.82us 98.862us 120.58us cuDeviceGetName 0.03% 409.59us 2 204.79us 200.33us 209.26us cuLinkAddData 0.03% 393.75us 2 196.87us 185.18us 208.56us cuLinkComplete 0.01% 218.68us 2 109.34us 79.726us 138.96us cuLaunchKernel 0.00% 14.052us 3 4.6840us 406ns 11.886us cuDeviceGetCount 0.00% 13.391us 12 1.1150us 682ns 1.5910us cuDeviceGetAttribute 0.00% 13.207us 8 1.6500us 1.0110us 3.1970us cuDeviceGet 0.00% 6.6800us 10 668ns 366ns 1.6910us cuFuncGetAttribute 0.00% 6.3560us 1 6.3560us 6.3560us 6.3560us cuCtxPushCurrent 0.00% 4.1940us 2 2.0970us 1.9810us 2.2130us cuModuleGetFunction 0.00% 4.0220us 4 1.0050us 740ns 1.7010us cuDeviceComputeCapability 0.00% 2.5810us 2 1.2900us 1.1740us 1.4070us cuLinkDestroy #

如果可以选择存储方式，为了性能，列式存储是首选。在上面的示例中，行和内核花费了大约 6.7 毫秒，而列和内核花费了大约 2.2 毫秒。通过启动较少数量的块并让每个块使用循环执行多次缩减，可以稍微改进上面的行方式方法，但它不太可能比列方法更快。

请注意，此代码每个测试（行和列）需要大约 1.5GB 的存储空间，因此它不会运行在内存量非常小的 GPU 上（例如 2GB）显卡）。例如，您可以通过仅进行行测试或列测试，或者通过减少向量的数量，在小内存 GPU 上将其达到运行。

用 cuda python 求和矢量值

sum vectors values with cuda python

python

cuda

sum

vector

numba