cuda内存复制强制对齐

Question

我写了一个测试来说明我的问题，代码尝试将 16 个字节复制到 none-4 字节对齐的内存，但是 dest 是自动修改

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__
void Copy128(char *dest,const char *src)
{
    ((int*)dest)[0]=((int*)src)[0];
    ((int*)dest)[1]=((int*)src)[1];
    ((int*)dest)[2]=((int*)src)[2];
    ((int*)dest)[3]=((int*)src)[3];
}
__global__
void fill_src(char *src)
{
    for(int i=0; i<16; i++)
        src[i] = i+1; // starts from 1
}

int main()
{
    char* dest;
    cudaMalloc(&dest, 17);

    char* src;
    cudaMalloc(&src, 16);

    fill_src<<<1, 1>>>((char*)src); // fill some value for debugging

    // copy to dest+1 which is not aligned to 4
    Copy128<<<1, 1>>>(dest + 1, src);

    getchar();
}

在VS2013中调试代码如图，目标内存为0x40A80001，但实际复制到0x40A80000。

问题是如果 dest 没有对齐到 4 字节，它会被自动修改。而且是默默修改的，我花了好几个小时才找到这个bug。

我知道最好使用对齐良好的内存，但我正在写一些rar解压程序，解压一些字节然后连接一些字节，它不能总是对齐的。

我想我会在像 Copy256 这样的函数中使用 uint64。这是内存强制对齐的正常行为吗？任何可以关闭此功能的编译标志？还是我应该一个一个地复制字节？

环境：CUDA 6.5、Win7-32 位、VS2013

Answer 1

- 这是内存强制对齐的正常行为吗？ 是：引用自 here、"Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes"。

任何可以关闭此功能的编译标志？ 我猜不是，这可能与硬件有关

还是我应该一个一个地复制字节？ 如果您处理（非常）未对齐的内存，这是避免未对齐存储的唯一选择（如上所述）。但是，您应该尝试检测（在编译时或运行时）您的内存操作何时对齐，然后使用手头最宽的 load/store （int4 导致 ldg 指令，这将为您提供一种方法更好的带宽）

cuda内存复制强制对齐

cuda memory copy force aligned

cuda

memory-alignment