结合 `mmap` 和 UVM 功能

Question

是否有同时提供这些功能的功能？我正在寻找一个分配内存的函数，它具有 "memory-mapped"（如分配 mmap）和 UVM（可从主机和 GPU 设备访问）的特征。我看到 cudaHostAlloc 在设备可访问的主机内存上分配内存，但没有明显的方法将分配的内存范围声明为内存映射！

我的问题是：是否有API函数来分配具有上述特征的内存？

如果上述问题的答案是 "no"，那么，是否有一组 API 函数可供我调用并导致相同的行为？

例如，首先，我们使用 cudaMallocManaged 分配基于 UVM 的内存，然后使用特定的 API（POSIX 或 CUDA API）来将先前分配的内存声明为 "memory-mapped"（就像 mmap）？或者，副 vesa（使用 mmap 分配，然后向 CUDA 驱动程序声明范围为 UVM）？

任何其他建议也将不胜感激！

2018 年 12 月 13 日更新：

不幸的是，@tera 提供的建议似乎没有按预期工作。当代码在设备上执行时，设备似乎无法看到主机上的内存！

下面是我在编译命令中使用的代码。

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>


__global__
void touchKernel(char *d, char init, int n) {
    int index =  blockIdx.x *blockDim.x + threadIdx.x;
    if(index >= n)
        return;
    d[index] = init;
}


void process_file(char* filename, int n) {
    if(n < 0) {
        printf("Error in n: %d\n", n);
        exit(1);
    }
    size_t filesize = n*sizeof(char);
    size_t pagesize = (size_t) sysconf (_SC_PAGESIZE);

    //Open file
    int fd = open(filename, O_RDWR|O_CREAT, 0666);
    // assert(fd != -1);
    if(fd == -1) {
        perror("Open API");
        exit(1);
    }
    ftruncate(fd, filesize);

    //Execute mmap
    char* mmappedData = (char*) mmap(0, filesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
    assert(mmappedData != MAP_FAILED);
    printf("mmappedData: %p\n", mmappedData);

    for(int i=0;i<n;i++)
        mmappedData[i] = 'z';

    if(cudaSuccess != cudaHostRegister(mmappedData, filesize, cudaHostRegisterDefault)) {
        printf("Unable to register with CUDA!\n");
        exit(1);
    }

    int vec = 256;
    int gang = (n) / vec + 1;
    printf("gang: %d - vec: %d\n", gang, vec);
    touchKernel<<<gang, vec>>>((char*) mmappedData, 'a', n);
    cudaDeviceSynchronize();

    //Cleanup
    int rc = munmap(mmappedData, filesize);
    assert(rc == 0);


    close(fd);
}


int main(int argc, char const *argv[])
{
    process_file("buffer.obj", 10);

    return 0;
}

然后编译，这里是：

nvcc -g -O0 f1.cu && cuda-memcheck ./a.out

cuda-memcheck 将生成一些关于用户的输出，即线程无法到达类似于以下输出的内存地址：

========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137002 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137001 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
=========     at 0x000000b0 in touchKernel(char*, char, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7fdc8e137000 is out of bounds
=========     Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
=========     Host Frame:./a.out [0x22b22]
=========     Host Frame:./a.out [0x22d17]
=========     Host Frame:./a.out [0x570d5]
=========     Host Frame:./a.out [0x6db8]
=========     Host Frame:./a.out [0x6c76]
=========     Host Frame:./a.out [0x6cc3]
=========     Host Frame:./a.out [0x6a4c]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x351c13]
=========     Host Frame:./a.out [0x40a16]
=========     Host Frame:./a.out [0x6a51]
=========     Host Frame:./a.out [0x6ade]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./a.out [0x673a]
=========

以上输出表示代码未在设备上成功执行。

有什么建议吗？

2018 年 12 月 14 日更新

我将代码更改为以下内容：

__global__
void touchKernel(char *d, char init, int n) {
    int index =  blockIdx.x *blockDim.x + threadIdx.x;
    if(index >= n || index < 0)
        return;
    printf("index %d\n", index);
    d[index] = init + (index%20);
    printf("index %d - Done\n", index);
}

如果将上面的代码替换为旧代码，可以看到两个 printf 命令的输出。如果检查 buffer.obj 文件，他们可以看到该文件包含正确的输出！

2018 年 12 月 14 日更新

可能 cuda-memcheck 有一些问题。原来如果执行文件没有cuda-memcheck，那么buffer.obj的内容完全正确。但是，如果可执行文件with cuda-memcheck执行，那么输出文件（buffer.obj）的内容就完全不正确!

Answer 1

巧合的是，我刚刚在 Nvidia 的论坛上回复了 similar question。

你可以cudaHostRegister() mmapped memory if you pass the MAP_LOCKED flag to mmap().

这样做时您可能需要增加锁定内存的限制（bash 中的ulimit -m）。

更新： 事实证明 MAP_LOCKED flag 到 mmap() 甚至不是必需的。 cudaHostRegister() 的文档列出了一些其他限制：

在没有统一虚拟寻址的系统上，需要将cudaHostRegisterMapped标志传递给cudaHostRegister()，否则内存将不会被映射。除非设备对 cudaDevAttrCanUseHostPointerForRegisteredMem attribute, this also means you need to query the device address for the mapped memory range via cudaHostGetDevicePointer().
必须使用 cudaMapHost 标志创建 CUDA 上下文才能进行映射。由于上下文是由运行时 API 延迟创建的，您需要在运行时 API 的任何调用之前使用驱动程序 API 自己创建上下文，以便能够影响创建上下文的标志。

结合 `mmap` 和 UVM 功能

Combining `mmap` and UVM features

memory-management

cuda

mmap

memory-mapped-files

memory-mapping