CUDA 统一内存页面在 CPU 中访问但未从 GPU 中逐出

CUDA unified memory pages accessed in CPU but not evicted from GPU

我试图了解 CUDA 统一内存的功能。我已经阅读了blog on CUDA unified memory for beginners。我写了下面给出的代码:

#include <cstdio>
#include <iostream>
#include <fstream>
#include <climits>
#include <vector>

__global__ void transfer(int *X)
{
    X[threadIdx.x] = X[threadIdx.x]+3;
}
using namespace std;
int main()
{
    int *x;
    size_t free_bytes, total_bytes;
    
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "Before cudaMallocManaged: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    cudaMallocManaged(&x,sizeof(int)*512);
    
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After cudaMallocManaged and Before Prefetch to GPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    std::cout <<  cudaMemPrefetchAsync(x, sizeof(int)*512, 0);
    cudaMemset(x,0,sizeof(int)*512);
    cudaDeviceSynchronize();
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter Prefetch to GPU Before Kernel call: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    transfer<<<1,512>>>(x);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After Kernel call Before memAdvise: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    cudaMemAdvise(x,sizeof(int)*512, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After memAdvise Before Prefetch to CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    std::cout << cudaMemPrefetchAsync(x, sizeof(int)*512, cudaCpuDeviceId);
    cudaDeviceSynchronize();
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter Prefetch Before processing in CPU: " << "free: " << free_bytes << " total: " << total_bytes <<'\n'; 
    for(int i=0;i<512;i++)
    {
        x[i] = x[i]+1;
        std::cout << x[i];
    }
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "\nAfter processing in CPU Before free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    cudaFree(x);
    cudaMemGetInfo(&free_bytes, &total_bytes);
    std::cout << "After free: " << "free: " << free_bytes << " total: " << total_bytes <<'\n';
    return 0;
}

输出:

Before cudaMallocManaged: free: 16804216832 total: 17071734784
After cudaMallocManaged and Before Prefetch to GPU: free: 16804216832 total: 17071734784
0
After Prefetch to GPU Before Kernel call: free: 16669999104 total: 17071734784
After Kernel call Before memAdvise: free: 16669999104 total: 17071734784
After memAdvise Before Prefetch to CPU: free: 16669999104 total: 17071734784
0
After Prefetch Before processing in CPU: free: 16669999104 total: 17071734784
44444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
After processing in CPU Before free: free: 16669999104 total: 17071734784
After free: free: 16674193408 total: 17071734784

我是 运行 Kaggle 上的代码,它提供 16 GB Tesla P100 PCIe GPU。我有一个使用 cudaMallocManaged() 分配的整数数组 x。首先,我在 GPU 中预取数组并对其进行一些处理,然后将其预取为 CPU 并进行一些处理。在这两者之间,我打印内存传输前后 GPU 上可用的空闲内存。基于此我有两个问题:

  1. cudaMallocManaged() 之后的第一次预取期间,空闲内存的减少比我分配的要多得多。为什么?

  2. 预取到CPU前后的空闲内存是一样的。此外,当我访问和修改 CPU 上的数组时,GPU 上的可用内存在此之前和之后仍然保持不变。我不明白为什么会这样。当 prefetching/processing CPU 上的统一内存位置不应该驱逐 GPU 上的相应页面并移动到 CPU 并且这不应该释放 GPU 内存吗?

  1. 在 GPU 上拥有功能齐全的 CUDA 环境需要相当大的开销。这可能超过 CUDA 开销所需的 space 的 100MB,不包括您的数据
  2. CUDA 有一个惰性初始化系统。

During the first prefetch just after cudaMallocManaged() the free memory decreases a lot more than I am allocating. Why?

因为CUDA有一个惰性初始化系统。这意味着它可能会为 运行 你的内核代码构建越来越多的必要环境,以及与之相关的内存开销,因为你继续使 CUDA 运行time API在你的程序中调用。在内核启动时,除了与新资源使用相关的事情外,大部分或所有初始化都将完成。因此,可用内存的减少是由于您的分配加上 CUDA 本身的额外开销。

The free memory before and after prefetching to the CPU is the same. Also, when I access and modify the array on the CPU the free memory on GPU before and after this still remains the same. I don't understand why this is happening.

我们所说的内存量大约为 100MB。与此相比,您分配的 512*sizeof(int) 微不足道。此外,CUDA 文档中没有声明由于需求分页,底层分配会发生什么。您似乎认为当内容被分页时,请求分页会自动释放底层分配。这在任何地方都没有说明,也不是这样。此处的确切行为未指定。此外,您设置中的 GPU 具有超额订阅的能力,因此没有特别的理由立即释放分配。

When prefetching/processing a unified memory location on CPU shouldn't the corresponding pages on GPU be evicted and moved to CPU and shouldn't this free up the GPU memory?

预取与逐出不同。但是,是的,预取到 CPU 意味着相应的页面不再驻留在该 GPU 的内存中。不,没有理由认为这个 automatically/immediately 释放了 GPU 内存。您可以预期,当您对分配的指针执行 cudaFree 操作时,内存将被释放,而不是之前。