如何指示 CuPy 在 GPU 中同时运行多个相同的作业？

Question

下面是一个名为 job() 的简单函数，它在 GPU 中执行多项 CuPy 任务。

我如何指示 CuPy 同时运行 job() 一百万次然后汇总它们的结果？

我的问题的目的是了解如何通过 CuPy 将多个并发作业提交到一个 GPU。

测试脚本

import numpy as np
import cupy as cp

def job( nsamples ):
    # Do some CuPy tasks in GPU
    d_a = cp.random.randn( nsamples )
    d_b = cp.random.randint( -3, high=3, size=nsamples )
    d_result = ( d_a + d_b )
    d_hist, _ = cp.histogram( d_result, bins=cp.array([-3,-2,-1,0,1,2,3,4]) )
    std = cp.std( d_hist )
    return std

# Perform 1 job in GPU
nsamples = 10 #can be as large as tens to hundreds of thousands
std = job( nsamples, 0 )
print( 'std', std, type(std) )

更新：

# Create Cuda streams
d_streams = []
for i in range(0, 10):
    d_streams.append( cp.cuda.stream.Stream( non_blocking=True ) )

# Perform Concurrent jobs via Cuda Stream.
results = []
for stream in d_streams:
    with stream:
        results.append( job( nsamples ) )
print( 'results', results, len(results), type(std) )

阅读这篇 Nvidia developer blog on Cuda Stream, this CuPy issue on Support CUDA stream with stream memory pool and this SOF question on CuPy Concurrency 后，我尝试了上面的方法，似乎有效。但是，我不知道如何查看作业是运行ning 并发还是串行。

问题：

我如何分析 Cupy 在 GPU 中执行的作业，以评估我的脚本是否正在执行我想要的操作？答：nvprof --print-gpu-trace python filename.py
我可以发出的流的数量是否有限制（例如，受某些硬件限制）还是“无限”？

Answer 1

我的一般建议是将所有数据连接在一起（跨作业）并寻求以数据并行方式完成工作。这是一个粗略的例子：

$ cat t34.py
import numpy as np
import cupy as cp

def job( nsamples, njobs ):
    # Do some CuPy tasks in GPU
    d_a = cp.random.randn( nsamples, njobs )
    d_b = cp.random.randint( -3, high=3, size=(nsamples, njobs) )
    d_result = ( d_a + d_b )
    mybins = cp.array([-3,-2,-1,0,1,2,3,4])
    d_hist = cp.zeros((njobs,mybins.shape[0]-1))
    for i in range(njobs):
      d_hist[i,:], _ = cp.histogram( d_result[i,:], bins=mybins )
    std = cp.std( d_hist, axis=1 )
    return std

nsamples = 10 #can be as large as tens to hundreds of thousands
std = job( nsamples, 2 )
print( 'std', std, type(std) )
$ python t34.py
std [0.69985421 0.45175395] <class 'cupy.core.core.ndarray'>
$

对于 job 中的大部分操作，我们可以执行适当的 cupy 操作来处理所有作业的工作。举一个例子，std 函数可以很容易地扩展到在所有作业中执行它的工作。 histogram 是个例外，因为 numpy 或 cupy 中的例程不允许 partitioned/segmented 算法，我可以看到。所以我为此使用了一个循环。如果这是您想要做的实际工作，则可以将分区直方图 cupy 例程编写为 cupy kernel。另一种选择是仅在流中发布 cupy 直方图。

如何指示 CuPy 在 GPU 中同时运行多个相同的作业？

How to instruct CuPy to run multiple number of the same job concurrently in a GPU?

python

concurrency

gpgpu

cupy

如何指示 CuPy 在 GPU 中同时 运行 多个相同的作业？

How to instruct CuPy to run multiple number of the same job concurrently in a GPU?

python

concurrency

gpgpu

cupy

如何指示 CuPy 在 GPU 中同时运行多个相同的作业？