在多 GPU 上启动异步内存复制操作

Launching asynchronous memory copy opeerations on multiple-GPUs

我想将主机上的数据数组划分并复制到多个GPU的设备内存中。另外,我想同时进行所有这些复制操作。

为此,我使用了 cudaMemcpyAsync,我在每个 GPU 的私有流中启动它。

这是我在做的(代码中的疑问用 ?? 开头的注释标记)

#define SIZE 1000
#define GPUCOUNT 2

int* hostData = nullptr;
int *devData[GPUCOUNT];
cudaStream_t stream[GPUCOUNT];

// Create one stream per GPU
for ( int i=0; i != GPUCOUNT ; ++i )
{    
    // DO I need to call cudaSetDevice before creating stream for each GPU ??
    cudaStreamCreate(&stream[i]));
}

// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );

// Allocate data on each device and copy part of host data to it
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  ); // ?? Does blocking behavior of cudamalloc prevents asynch memcpy invoked in stream of other GPUs from running concurrently 
   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
}

// Some CPU code while copy is happening
// ....

// Wait for copy on all streams to finish
cudaDeviceSynchronize();

// Do something else

当我阅读 C 编程指南时,我发现上述内存复制操作不会异步发生,因为在两次连续的异步内存复制启动之间,我正在调用分配设备内存的主机操作(阻塞调用)。

3.2.5.5.4. Implicit Synchronization

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

‣ a page-locked host memory allocation,

‣ a device memory allocation,

‣ a device memory set,

‣ a memory copy between two addresses to the same device memory,

‣ any CUDA command to the default stream,

如果上述原因似乎是真的,那么我需要拆分我的内存分配和复制操作

// Allocate data on each device 
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  );
}

// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   // ?? DO I need to call cudaSetDevice before memory copy ??
   // CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."

   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] ); 
}

我上面的分析是否有效?

此外,如果不通过在每个 GPU 的默认流(流 ID 0)中启动 cudaMemcpyAsync 操作来创建显式的每个 gpu 流,是否无法做到这一点?。我基于 CUDA C 编程指南中的以下声明:

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

代码看起来像这样

#define SIZE 1000
#define GPUCOUNT 2

int* hostData = nullptr;
int *devData[GPUCOUNT];

// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );

// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   cudaSetDevice(i);
   cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT  );
}

// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
   // ?? DO I need to call cudaSetDevice before memory copy ??
   // CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."

   cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, 0 ); 
}

// Some CPU code while copy is happening
// ....

// Wait for copy on all streams to finish
cudaDeviceSynchronize();

// Do something else

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g418c299b069c4803bfb7cab4943da383.html

cudaError_t cudaSetDevice   (   int     device   )      

Sets device as the current device for the calling host thread.

Any device memory subsequently allocated from this host thread using cudaMalloc(), cudaMallocPitch() or cudaMallocArray() will be physically resident on device. Any host memory allocated from this host thread using cudaMallocHost() or cudaHostAlloc() or cudaHostRegister() will have its lifetime associated with device. Any streams or events created from this host thread will be associated with device. Any kernels launched from this host thread using the <<<>>> operator or cudaLaunch() will be executed on device.

This call may be made from any host thread, to any device, and at any time. This function will do no synchronization with the previous or new device, and should be considered a very low overhead call.

看起来 set device 可以在没有流的情况下完成您需要的一切。您应该能够浏览每个设备,使用它们的默认流并调用 malloc 和 memcpy。使用异步 memcpy 和基于流的内核调用将有助于设备上的并发内存传输和内核调用。

您确实需要在对该设备的所有调用之前调用 setdevice。 Streams 不会对此提供帮助。