用 cuda python 求和矢量值
sum vectors values with cuda python
我尝试使用 CUDA python 对许多向量值求和。我找到了一种使用共享内存 的解决方案。有没有办法在没有共享内存的情况下做到这一点[因为共享内存的内存量很小]?我的向量大小是:
N = 1000
i = 300000
v[i] = [1,2,..., N]
结果我需要得到:
out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]
感谢任何建议:)
要一次执行多个缩减,并且对于您指出的问题维度,了解您的向量是按行存储在内存中还是按列存储很重要。
对于行式存储方法,块式并行归约方法应该非常快。每个块将对单个向量执行 a standard sweep-based parallel reduction,然后将结果作为单个数字写入输出。
对于列式存储方法,对于您指出的问题维度(特别是"large"个向量),每个线程[=25]将是高效的=] 使用遍历列的简单循环对向量执行缩减。
这是两种方法的一个有效示例:
# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32
#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
sm = cuda.shared.array(threadsperblock, float32)
bid = cuda.blockIdx.x
tid = cuda.threadIdx.x
bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
lid = tid
sm[lid] = 0
while lid < N:
sm[tid] += vecs[bid, lid];
lid += bdim
cuda.syncthreads()
# perform shared memory sweep reduction
sweep = bdim//2
while sweep > 0:
if tid < sweep:
sm[tid] += sm[tid + sweep]
sweep = sweep//2
cuda.syncthreads()
if tid == 0:
sums[bid] = sm[0]
#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
idx = cuda.grid(1)
if idx >= NV:
return
temp = 0
for i in range(N):
temp += vecs[i,idx]
sums[idx] = temp
#peform row-test
rvecs = np.ones((NV, N), dtype=np.float32)
sums = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.20% 1.12464s 2 562.32ms 557.25ms 567.39ms [CUDA memcpy HtoD]
0.59% 6.6881ms 1 6.6881ms 6.6881ms 6.6881ms cudapy::__main__::vec_sum_row1(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.20% 2.2250ms 1 2.2250ms 2.2250ms 2.2250ms cudapy::__main__::vec_sum_col2(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.02% 212.83us 2 106.42us 104.45us 108.38us [CUDA memcpy DtoH]
API calls: 73.60% 1.12571s 2 562.85ms 557.77ms 567.94ms cuMemcpyHtoD
25.30% 386.91ms 1 386.91ms 386.91ms 386.91ms cuDevicePrimaryCtxRetain
0.64% 9.8042ms 2 4.9021ms 2.6113ms 7.1929ms cuMemcpyDtoH
0.23% 3.4945ms 3 1.1648ms 182.38us 1.6636ms cuMemAlloc
0.07% 999.98us 2 499.99us 62.409us 937.57us cuLinkCreate
0.04% 678.12us 2 339.06us 331.01us 347.12us cuModuleLoadDataEx
0.03% 458.51us 1 458.51us 458.51us 458.51us cuMemGetInfo
0.03% 431.28us 4 107.82us 98.862us 120.58us cuDeviceGetName
0.03% 409.59us 2 204.79us 200.33us 209.26us cuLinkAddData
0.03% 393.75us 2 196.87us 185.18us 208.56us cuLinkComplete
0.01% 218.68us 2 109.34us 79.726us 138.96us cuLaunchKernel
0.00% 14.052us 3 4.6840us 406ns 11.886us cuDeviceGetCount
0.00% 13.391us 12 1.1150us 682ns 1.5910us cuDeviceGetAttribute
0.00% 13.207us 8 1.6500us 1.0110us 3.1970us cuDeviceGet
0.00% 6.6800us 10 668ns 366ns 1.6910us cuFuncGetAttribute
0.00% 6.3560us 1 6.3560us 6.3560us 6.3560us cuCtxPushCurrent
0.00% 4.1940us 2 2.0970us 1.9810us 2.2130us cuModuleGetFunction
0.00% 4.0220us 4 1.0050us 740ns 1.7010us cuDeviceComputeCapability
0.00% 2.5810us 2 1.2900us 1.1740us 1.4070us cuLinkDestroy
#
如果可以选择存储方式,为了性能,列式存储是首选。在上面的示例中,行和内核花费了大约 6.7 毫秒,而列和内核花费了大约 2.2 毫秒。通过启动较少数量的块并让每个块使用循环执行多次缩减,可以稍微改进上面的行方式方法,但它不太可能比列方法更快。
请注意,此代码每个测试(行和列)需要大约 1.5GB 的存储空间,因此它不会 运行 在内存量非常小的 GPU 上(例如 2GB)显卡)。例如,您可以通过仅进行行测试或列测试,或者通过减少向量的数量,在小内存 GPU 上将其达到 运行。
我尝试使用 CUDA python 对许多向量值求和。我找到了一种使用共享内存
N = 1000
i = 300000
v[i] = [1,2,..., N]
结果我需要得到:
out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]
感谢任何建议:)
要一次执行多个缩减,并且对于您指出的问题维度,了解您的向量是按行存储在内存中还是按列存储很重要。
对于行式存储方法,块式并行归约方法应该非常快。每个块将对单个向量执行 a standard sweep-based parallel reduction,然后将结果作为单个数字写入输出。
对于列式存储方法,对于您指出的问题维度(特别是"large"个向量),每个线程[=25]将是高效的=] 使用遍历列的简单循环对向量执行缩减。
这是两种方法的一个有效示例:
# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32
#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
sm = cuda.shared.array(threadsperblock, float32)
bid = cuda.blockIdx.x
tid = cuda.threadIdx.x
bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
lid = tid
sm[lid] = 0
while lid < N:
sm[tid] += vecs[bid, lid];
lid += bdim
cuda.syncthreads()
# perform shared memory sweep reduction
sweep = bdim//2
while sweep > 0:
if tid < sweep:
sm[tid] += sm[tid + sweep]
sweep = sweep//2
cuda.syncthreads()
if tid == 0:
sums[bid] = sm[0]
#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
idx = cuda.grid(1)
if idx >= NV:
return
temp = 0
for i in range(N):
temp += vecs[i,idx]
sums[idx] = temp
#peform row-test
rvecs = np.ones((NV, N), dtype=np.float32)
sums = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.20% 1.12464s 2 562.32ms 557.25ms 567.39ms [CUDA memcpy HtoD]
0.59% 6.6881ms 1 6.6881ms 6.6881ms 6.6881ms cudapy::__main__::vec_sum_row1(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.20% 2.2250ms 1 2.2250ms 2.2250ms 2.2250ms cudapy::__main__::vec_sum_col2(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.02% 212.83us 2 106.42us 104.45us 108.38us [CUDA memcpy DtoH]
API calls: 73.60% 1.12571s 2 562.85ms 557.77ms 567.94ms cuMemcpyHtoD
25.30% 386.91ms 1 386.91ms 386.91ms 386.91ms cuDevicePrimaryCtxRetain
0.64% 9.8042ms 2 4.9021ms 2.6113ms 7.1929ms cuMemcpyDtoH
0.23% 3.4945ms 3 1.1648ms 182.38us 1.6636ms cuMemAlloc
0.07% 999.98us 2 499.99us 62.409us 937.57us cuLinkCreate
0.04% 678.12us 2 339.06us 331.01us 347.12us cuModuleLoadDataEx
0.03% 458.51us 1 458.51us 458.51us 458.51us cuMemGetInfo
0.03% 431.28us 4 107.82us 98.862us 120.58us cuDeviceGetName
0.03% 409.59us 2 204.79us 200.33us 209.26us cuLinkAddData
0.03% 393.75us 2 196.87us 185.18us 208.56us cuLinkComplete
0.01% 218.68us 2 109.34us 79.726us 138.96us cuLaunchKernel
0.00% 14.052us 3 4.6840us 406ns 11.886us cuDeviceGetCount
0.00% 13.391us 12 1.1150us 682ns 1.5910us cuDeviceGetAttribute
0.00% 13.207us 8 1.6500us 1.0110us 3.1970us cuDeviceGet
0.00% 6.6800us 10 668ns 366ns 1.6910us cuFuncGetAttribute
0.00% 6.3560us 1 6.3560us 6.3560us 6.3560us cuCtxPushCurrent
0.00% 4.1940us 2 2.0970us 1.9810us 2.2130us cuModuleGetFunction
0.00% 4.0220us 4 1.0050us 740ns 1.7010us cuDeviceComputeCapability
0.00% 2.5810us 2 1.2900us 1.1740us 1.4070us cuLinkDestroy
#
如果可以选择存储方式,为了性能,列式存储是首选。在上面的示例中,行和内核花费了大约 6.7 毫秒,而列和内核花费了大约 2.2 毫秒。通过启动较少数量的块并让每个块使用循环执行多次缩减,可以稍微改进上面的行方式方法,但它不太可能比列方法更快。
请注意,此代码每个测试(行和列)需要大约 1.5GB 的存储空间,因此它不会 运行 在内存量非常小的 GPU 上(例如 2GB)显卡)。例如,您可以通过仅进行行测试或列测试,或者通过减少向量的数量,在小内存 GPU 上将其达到 运行。