结合 `mmap` 和 UVM 功能
Combining `mmap` and UVM features
是否有同时提供这些功能的功能?我正在寻找一个分配内存的函数,它具有 "memory-mapped"(如分配 mmap
)和 UVM(可从主机和 GPU 设备访问)的特征。我看到 cudaHostAlloc
在设备可访问的主机内存上分配内存,但没有明显的方法将分配的内存范围声明为内存映射!
我的问题是:是否有API函数来分配具有上述特征的内存?
如果上述问题的答案是 "no",那么,是否有一组 API 函数可供我调用并导致相同的行为?
例如,首先,我们使用 cudaMallocManaged
分配基于 UVM 的内存,然后使用特定的 API(POSIX 或 CUDA API)来将先前分配的内存声明为 "memory-mapped"(就像 mmap
)?或者,副 vesa(使用 mmap
分配,然后向 CUDA 驱动程序声明范围为 UVM)?
任何其他建议也将不胜感激!
2018 年 12 月 13 日更新:
不幸的是,@tera 提供的建议似乎没有按预期工作。当代码在设备上执行时,设备似乎无法看到主机上的内存!
下面是我在编译命令中使用的代码。
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n)
return;
d[index] = init;
}
void process_file(char* filename, int n) {
if(n < 0) {
printf("Error in n: %d\n", n);
exit(1);
}
size_t filesize = n*sizeof(char);
size_t pagesize = (size_t) sysconf (_SC_PAGESIZE);
//Open file
int fd = open(filename, O_RDWR|O_CREAT, 0666);
// assert(fd != -1);
if(fd == -1) {
perror("Open API");
exit(1);
}
ftruncate(fd, filesize);
//Execute mmap
char* mmappedData = (char*) mmap(0, filesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
assert(mmappedData != MAP_FAILED);
printf("mmappedData: %p\n", mmappedData);
for(int i=0;i<n;i++)
mmappedData[i] = 'z';
if(cudaSuccess != cudaHostRegister(mmappedData, filesize, cudaHostRegisterDefault)) {
printf("Unable to register with CUDA!\n");
exit(1);
}
int vec = 256;
int gang = (n) / vec + 1;
printf("gang: %d - vec: %d\n", gang, vec);
touchKernel<<<gang, vec>>>((char*) mmappedData, 'a', n);
cudaDeviceSynchronize();
//Cleanup
int rc = munmap(mmappedData, filesize);
assert(rc == 0);
close(fd);
}
int main(int argc, char const *argv[])
{
process_file("buffer.obj", 10);
return 0;
}
然后编译,这里是:
nvcc -g -O0 f1.cu && cuda-memcheck ./a.out
cuda-memcheck
将生成一些关于用户的输出,即线程无法到达类似于以下输出的内存地址:
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x7fdc8e137002 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x7fdc8e137001 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7fdc8e137000 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x351c13]
========= Host Frame:./a.out [0x40a16]
========= Host Frame:./a.out [0x6a51]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
以上输出表示代码未在设备上成功执行。
有什么建议吗?
2018 年 12 月 14 日更新
我将代码更改为以下内容:
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n || index < 0)
return;
printf("index %d\n", index);
d[index] = init + (index%20);
printf("index %d - Done\n", index);
}
如果将上面的代码替换为旧代码,可以看到两个 printf
命令的输出。如果检查 buffer.obj
文件,他们可以看到该文件包含正确的输出!
2018 年 12 月 14 日更新
可能 cuda-memcheck
有一些问题。原来如果执行文件没有cuda-memcheck
,那么buffer.obj
的内容完全正确。但是,如果可执行文件with cuda-memcheck
执行,那么输出文件(buffer.obj
)的内容就完全不正确!
巧合的是,我刚刚在 Nvidia 的论坛上回复了 similar question。
你可以cudaHostRegister()
mmapped memory if you pass the MAP_LOCKED
flag to mmap()
.
这样做时您可能需要增加锁定内存的限制(bash 中的ulimit -m
)。
更新:
事实证明 MAP_LOCKED flag
到 mmap()
甚至不是必需的。 cudaHostRegister()
的文档列出了一些其他限制:
- 在没有统一虚拟寻址的系统上,需要将
cudaHostRegisterMapped
标志传递给cudaHostRegister()
,否则内存将不会被映射。除非设备对 cudaDevAttrCanUseHostPointerForRegisteredMem
attribute, this also means you need to query the device address for the mapped memory range via cudaHostGetDevicePointer()
. 具有非零值
- 必须使用
cudaMapHost
标志创建 CUDA 上下文才能进行映射。由于上下文是由运行时 API 延迟创建的,您需要在运行时 API 的任何调用之前使用驱动程序 API 自己创建上下文,以便能够影响创建上下文的标志。
是否有同时提供这些功能的功能?我正在寻找一个分配内存的函数,它具有 "memory-mapped"(如分配 mmap
)和 UVM(可从主机和 GPU 设备访问)的特征。我看到 cudaHostAlloc
在设备可访问的主机内存上分配内存,但没有明显的方法将分配的内存范围声明为内存映射!
我的问题是:是否有API函数来分配具有上述特征的内存?
如果上述问题的答案是 "no",那么,是否有一组 API 函数可供我调用并导致相同的行为?
例如,首先,我们使用 cudaMallocManaged
分配基于 UVM 的内存,然后使用特定的 API(POSIX 或 CUDA API)来将先前分配的内存声明为 "memory-mapped"(就像 mmap
)?或者,副 vesa(使用 mmap
分配,然后向 CUDA 驱动程序声明范围为 UVM)?
任何其他建议也将不胜感激!
2018 年 12 月 13 日更新:
不幸的是,@tera 提供的建议似乎没有按预期工作。当代码在设备上执行时,设备似乎无法看到主机上的内存!
下面是我在编译命令中使用的代码。
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <assert.h>
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n)
return;
d[index] = init;
}
void process_file(char* filename, int n) {
if(n < 0) {
printf("Error in n: %d\n", n);
exit(1);
}
size_t filesize = n*sizeof(char);
size_t pagesize = (size_t) sysconf (_SC_PAGESIZE);
//Open file
int fd = open(filename, O_RDWR|O_CREAT, 0666);
// assert(fd != -1);
if(fd == -1) {
perror("Open API");
exit(1);
}
ftruncate(fd, filesize);
//Execute mmap
char* mmappedData = (char*) mmap(0, filesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_LOCKED, fd, 0);
assert(mmappedData != MAP_FAILED);
printf("mmappedData: %p\n", mmappedData);
for(int i=0;i<n;i++)
mmappedData[i] = 'z';
if(cudaSuccess != cudaHostRegister(mmappedData, filesize, cudaHostRegisterDefault)) {
printf("Unable to register with CUDA!\n");
exit(1);
}
int vec = 256;
int gang = (n) / vec + 1;
printf("gang: %d - vec: %d\n", gang, vec);
touchKernel<<<gang, vec>>>((char*) mmappedData, 'a', n);
cudaDeviceSynchronize();
//Cleanup
int rc = munmap(mmappedData, filesize);
assert(rc == 0);
close(fd);
}
int main(int argc, char const *argv[])
{
process_file("buffer.obj", 10);
return 0;
}
然后编译,这里是:
nvcc -g -O0 f1.cu && cuda-memcheck ./a.out
cuda-memcheck
将生成一些关于用户的输出,即线程无法到达类似于以下输出的内存地址:
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x7fdc8e137002 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (1,0,0) in block (0,0,0)
========= Address 0x7fdc8e137001 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Invalid __global__ write of size 1
========= at 0x000000b0 in touchKernel(char*, char, int)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7fdc8e137000 is out of bounds
========= Device Frame:touchKernel(char*, char, int) (touchKernel(char*, char, int) : 0xb0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24d9dd]
========= Host Frame:./a.out [0x22b22]
========= Host Frame:./a.out [0x22d17]
========= Host Frame:./a.out [0x570d5]
========= Host Frame:./a.out [0x6db8]
========= Host Frame:./a.out [0x6c76]
========= Host Frame:./a.out [0x6cc3]
========= Host Frame:./a.out [0x6a4c]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x351c13]
========= Host Frame:./a.out [0x40a16]
========= Host Frame:./a.out [0x6a51]
========= Host Frame:./a.out [0x6ade]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./a.out [0x673a]
=========
以上输出表示代码未在设备上成功执行。
有什么建议吗?
2018 年 12 月 14 日更新
我将代码更改为以下内容:
__global__
void touchKernel(char *d, char init, int n) {
int index = blockIdx.x *blockDim.x + threadIdx.x;
if(index >= n || index < 0)
return;
printf("index %d\n", index);
d[index] = init + (index%20);
printf("index %d - Done\n", index);
}
如果将上面的代码替换为旧代码,可以看到两个 printf
命令的输出。如果检查 buffer.obj
文件,他们可以看到该文件包含正确的输出!
2018 年 12 月 14 日更新
可能 cuda-memcheck
有一些问题。原来如果执行文件没有cuda-memcheck
,那么buffer.obj
的内容完全正确。但是,如果可执行文件with cuda-memcheck
执行,那么输出文件(buffer.obj
)的内容就完全不正确!
巧合的是,我刚刚在 Nvidia 的论坛上回复了 similar question。
你可以cudaHostRegister()
mmapped memory if you pass the MAP_LOCKED
flag to mmap()
.
这样做时您可能需要增加锁定内存的限制(bash 中的ulimit -m
)。
更新:
事实证明 MAP_LOCKED flag
到 mmap()
甚至不是必需的。 cudaHostRegister()
的文档列出了一些其他限制:
- 在没有统一虚拟寻址的系统上,需要将
cudaHostRegisterMapped
标志传递给cudaHostRegister()
,否则内存将不会被映射。除非设备对cudaDevAttrCanUseHostPointerForRegisteredMem
attribute, this also means you need to query the device address for the mapped memory range viacudaHostGetDevicePointer()
. 具有非零值
- 必须使用
cudaMapHost
标志创建 CUDA 上下文才能进行映射。由于上下文是由运行时 API 延迟创建的,您需要在运行时 API 的任何调用之前使用驱动程序 API 自己创建上下文,以便能够影响创建上下文的标志。