Julia:多个 GPU 上的并行 CUSPARSE 计算

Julia: Parallel CUSPARSE calculations on multiple GPUs

我有 n 个独立的 GPU,每个 GPU 都存储自己的数据。我想让他们每个人同时执行一组计算。 CUDArt 文档 here describes the use of streams to asynchronously call custom C kernels in order to achieve parallelization (see also this other example here)。对于自定义内核,这可以通过在 CUDArt 的 launch() 函数实现中使用 stream 参数来实现。然而,据我所知,CUSPARSE(或 CUBLAS)函数没有类似的流规范选项。

这对 CUSPARSE 是否可行,或者如果我想使用多个 GPU,我是否只需要深入了解 C?

修订赏金更新

好的,所以,我现在终于有了一个相对不错的解决方案。但是,我确信它可以通过一百万种方式进行改进——现在它还很老套。特别是,我喜欢按照我在 this SO 问题(我从未正确工作)中尝试和写的内容提出解决方案建议。因此,我很高兴将赏金奖励给在这里有更多想法的任何人。

好的,所以,我想我终于找到了至少相对有效的东西。我仍然非常乐意向任何有进一步改进的人提供赏金。特别是,根据 this SO 问题中描述的我尝试(但失败)实施的设计进行改进会很棒。但是,如果对此有任何改进或建议,我很乐意提供赏金。

我发现让 CUSPARSE 和 CUBLAS 之类的东西在多个 GPU 上并行化的方法的关键突破是你需要为每个 GPU 创建一个单独的句柄。例如。来自 CUBLAS API 上的 documentation:

The application must initialize the handle to the cuBLAS library context by calling the cublasCreate() function. Then, the is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the cuBLAS library context.

This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can use cudaSetDevice() to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the cuBLAS library context, which will use the particular device associated with that host thread. Then, the cuBLAS library function calls made with different handle will automatically dispatch the computation to different devices.

(强调)

请参阅 here and here 以获取一些其他有用的文档。

现在,为了在这方面真正取得进展,我不得不做一堆相当混乱的黑客攻击。将来,我希望与开发 CUSPARSE 和 CUBLAS 软件包的人取得联系,了解如何将其合并到他们的软件包中。不过目前,这就是我所做的:

首先,CUSPARSE 和 CUBLAS 包带有创建句柄的函数。但是,我不得不对包进行一些修改以导出这些函数(连同所需的其他函数和对象类型),以便我自己可以实际访问它们。

具体来说,我向 CUSPARSE.jl 添加了以下内容:

export libcusparse, SparseChar

libcusparse_types.jl以下内容:

export cusparseHandle_t, cusparseOperation_t, cusparseMatDescr_t, cusparseStatus_t

libcusparse.jl以下内容:

export cusparseCreate

sparse.jl 以下内容:

export getDescr, cusparseop

通过所有这些,我能够获得对 cusparseCreate() 函数的功能访问,该函数可用于创建新句柄(我不能只使用 CUSPARSE.cusparseCreate() 因为该函数依赖于一堆其他函数和数据类型)。从那里,我定义了一个我想要的矩阵乘法运算的新版本,它带有一个额外的参数 Handle,以将 ccall() 提供给 CUDA 驱动程序。以下是完整代码:

using CUDArt, CUSPARSE  ## note: modified version of CUSPARSE, as indicated above.

N = 10^3;
M = 10^6;
p = 0.1;

devlist = devices(dev->true);
nGPU = length(devlist)

dev_X = Array(CudaSparseMatrixCSR, nGPU)
dev_b = Array(CudaArray, nGPU)
dev_c = Array(CudaArray, nGPU)
Handles = Array(Array{Ptr{Void},1}, nGPU)


for (idx, dev) in enumerate(devlist)
    println("sending data to device $dev")
    device(dev) ## switch to given device
    dev_X[idx] = CudaSparseMatrixCSR(sprand(N,M,p))
    dev_b[idx] = CudaArray(rand(M))
    dev_c[idx] = CudaArray(zeros(N))
    Handles[idx] = cusparseHandle_t[0]
    cusparseCreate(Handles[idx])
end


function Pmv!(
    Handle::Array{Ptr{Void},1},
    transa::SparseChar,
    alpha::Float64,
    A::CudaSparseMatrixCSR{Float64},
    X::CudaVector{Float64},
    beta::Float64,
    Y::CudaVector{Float64},
    index::SparseChar)
    Mat     = A
    cutransa = cusparseop(transa)
    m,n = Mat.dims
    cudesc = getDescr(A,index)
    device(device(A))  ## necessary to switch to the device associated with the handle and data for the ccall 
    ccall(
        ((:cusparseDcsrmv),libcusparse), 

        cusparseStatus_t,

        (cusparseHandle_t, cusparseOperation_t, Cint,
        Cint, Cint, Ptr{Float64}, Ptr{cusparseMatDescr_t},
        Ptr{Float64}, Ptr{Cint}, Ptr{Cint}, Ptr{Float64},
        Ptr{Float64}, Ptr{Float64}), 

        Handle[1],
        cutransa, m, n, Mat.nnz, [alpha], &cudesc, Mat.nzVal,
        Mat.rowPtr, Mat.colVal, X, [beta], Y
    )
end

function test(Handles, dev_X, dev_b, dev_c, idx)
    Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
    device(idx-1)
    return to_host(dev_c[idx])
end


function test2(Handles, dev_X, dev_b, dev_c)

    @sync begin
        for (idx, dev) in enumerate(devlist)
            @async begin
                Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
            end
        end
    end
    Results = Array(Array{Float64}, nGPU)
    for (idx, dev) in enumerate(devlist)
        device(dev)
        Results[idx] = to_host(dev_c[idx]) ## to_host doesn't require setting correct device first.  But, it is  quicker if you do this.
    end

    return Results
end

## Function times given after initial run for compilation
@time a = test(Handles, dev_X, dev_b, dev_c, 1); ## 0.010849 seconds (12 allocations: 8.297 KB)
@time b = test2(Handles, dev_X, dev_b, dev_c);   ## 0.011503 seconds (68 allocations: 19.641 KB)

# julia> a == b[1]
# true

一个小的改进是将 ccall 表达式包装在检查函数中,以便在调用 CUDA returns 错误时获得输出。