为什么即使使用 -cudart static 进行编译，库用户仍然需要链接到 cuda 运行时

Question

我有一些简单的 cuda 代码，我正在使用 nvcc 编译成静态库，还有一些我正在使用 g++ 和 link 编译的用户代码编译好的静态库。尝试 link 时，即使我在 nvcc 编译命令行中使用 -cudart static 选项，我也会收到 cudaMalloc 之类的 linker 错误。

这是我的代码：

//kern.hpp
#include <cstddef>

class Kern
{
    private:
        float* d_data;
        size_t size;

    public:
        Kern(size_t s);
        ~Kern();
        void set_data(float *d); 
};

//kern.cu
#include <iostream>
#include <kern.hpp>

__global__ void kern(float* data, size_t size)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if(idx < size) 
    {
        data[idx] = 0;
    }
} 

Kern::Kern(size_t s) : size(s)
{
    cudaMalloc((void**)&d_data, size*sizeof(float));
}

Kern::~Kern()
{
    cudaFree(d_data);
}

void Kern::set_data(float* d)
{
    size_t grid_size = size;
    std::cout << "Starting kernel with grid size " << grid_size << " and block size " << 1 <<
        std::endl;
    kern<<<grid_size, 1>>>(d_data, size);
    cudaError_t err = cudaGetLastError();
    if(err != cudaSuccess)
        std::cout << "ERROR: " << cudaGetErrorString(err) << std::endl;
    cudaDeviceSynchronize();
    cudaMemcpy((void*)d, (void*)d_data, size*sizeof(float), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
}

//main.cpp
#include <iostream>
#include <kern.hpp>

int main(int argc, char** argv)
{
    std::cout << "starting" << std::endl;
    Kern k(256);
    float arr[256];
    k.set_data(arr);
    bool ok = true;
    for(int i = 0; i < 256; ++i) ok &= arr[i] == 0;
    std::cout << (ok ? "done" : "wrong") << std::endl;
}

我正在用 nvcc 编译内核，如下所示：

nvcc -I ./ -lib --compiler-options '-fPIC' -o libkern.a kern.cu -cudart static

然后主g++如下：

g++ -o main main.cpp -I ./ -L. -L/opt/cuda/lib64 -lkern

产生错误：

/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `Kern::Kern(unsigned long)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x4d): undefined reference to `cudaMalloc'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `Kern::~Kern()':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x6b): undefined reference to `cudaFree'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `Kern::set_data(float*)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x152): undefined reference to `__cudaPushCallConfiguration'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x175): undefined reference to `cudaGetLastError'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x1a1): undefined reference to `cudaGetErrorString'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x1c6): undefined reference to `cudaDeviceSynchronize'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x1ee): undefined reference to `cudaMemcpy'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x1f3): undefined reference to `cudaDeviceSynchronize'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `__cudaUnregisterBinaryUtil()':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x24e): undefined reference to `__cudaUnregisterFatBinary'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `__nv_init_managed_rt_with_module(void**)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x269): undefined reference to `__cudaInitModule'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `__device_stub__Z4kernPfm(float*, unsigned long)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x305): undefined reference to `__cudaPopCallConfiguration'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `__nv_cudaEntityRegisterCallback(void**)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x430): undefined reference to `__cudaRegisterFunction'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `__sti____cudaRegisterAll()':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x44b): undefined reference to `__cudaRegisterFatBinary'
/usr/bin/ld: tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x47c): undefined reference to `__cudaRegisterFatBinaryEnd'
/usr/bin/ld: ./libkern.a(tmpxft_00001d30_00000000-8_kern.o): in function `cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*)':
tmpxft_00001d30_00000000-5_kern.cudafe1.cpp:(.text+0x4d9): undefined reference to `cudaLaunchKernel'
collect2: error: ld returned 1 exit status

但是如果我执行以下操作：

g++ -o main main.cpp -I ./ -L. -L/opt/cuda/lib64 -lkern -lcudart

一切正常。我的问题是，因为我在 nvcc 编译行中有一个 -cudart static，所以 libkern.a 不应该已经解析了 cuda 运行时的符号吗？为什么 -lcudart 在 g++ 行中仍然是必需的？

此外，如果我将 libkern.a 更改为共享对象，则 linking 到 g++ 行中的 cuda 运行时不会起作用。即以下作品：

nvcc -I ./ -shared --compiler-options '-fPIC' -o libkern.so kern.cu -cudart static
g++ -o main main.cpp -I ./ -L. -L/opt/cuda/lib64 -lkern

为什么静态库版本失败，而共享对象版本有效？

请注意，在 nvcc 行中将 -cudart static 替换为 -lcudart_static 后，我尝试了上述方案，并且进行该替换后行为没有任何变化。这是意料之中的，因为这两个选项基本上做同样的事情对吗？

我在 linux.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

g++ --version
g++ (GCC) 10.1.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

非常感谢任何帮助and/or 澄清。

Answer 1

如果你研究 the nvcc documentation，很明显 -lib 选项创建一个静态库（并指定没有 linking），而 -shared 选项创建共享库，并指定 linking。例如摘录：

4.2.2.1. --link (-link) Specify the default behavior: compile and link all input files.

4.2.2.2. --lib (-lib) Compile all input files into object files, if necessary, and add the results to the specified library output file.

4.2.3.11. --shared (-shared) Generate a shared library during linking. Use option --linker-options when other linker options are required for more control.

我相信这或多或少与典型的 gcc/g++ 用法一致。如果您在“g++ create static library”上进行 google 搜索，您将得到任意数量的 references，这表明您基本上应该这样做：

g++ -c my_source_file.cpp ...
ar ...

换句话说，指定了源代码到对象的编译，但没有指定linking。举一个例子，cudaMalloc 是 CUDA 运行时库的一部分，与它的连接将在 link 阶段完成。

nvcc 是一个相当复杂的引擎盖下的动物，但我们应该记住，对于某些功能，它主要使用已安装的主机工具链。这包括编译主机代码，还包括最后的 link 阶段。

与此相结合，我相信您想在这里做的是“部分”linking 或增量 linking。在最后的 link 阶段之前执行一些最后的 link 阶段。

GNU linker（同样，nvcc 将在 linux 上默认使用什么）supports that，所以如果我们撇开对编译可重定位设备代码的任何关注，应该可以按如下方式执行您想要的操作：

$ nvcc  -Xcompiler '-fPIC' -I.  -c kern.cu
$ ld -o kern.ro -r kern.o -L/usr/local/cuda/lib64 -lcudart_static -lculibos
$ ar rs libkern.a kern.ro
ar: creating libkern.a
$ g++ -o main main.cpp  -I ./ -L.  -lkern -lpthread -lrt -ldl
$ cuda-memcheck ./main
========= CUDA-MEMCHECK
starting
Starting kernel with grid size 256 and block size 1
done
========= ERROR SUMMARY: 0 errors
$

备注：

-lpthread -lrt -ldl 是 cudart/culibos 的标准库依赖项，因此需要在最后的 link 阶段提供这些，但它们不依赖于任何CUDA 工具包项目。如果您希望这些依赖项也从增量 linked 对象中删除，我将其视为一个单独的问题，与 CUDA 无关。
归档步骤（库的创建）对于这个简单的案例来说不是必需的。我们可以将增量 linked (-r) 对象 kern.ro 直接传递到最后的 compilation/link 步骤。
请注意，您的 CUDA 安装显然位于不同的位置，因此可能需要更改上述某些库路径 (-L)。

为什么即使使用 -cudart static 进行编译，库用户仍然需要链接到 cuda 运行时

Why is linking to cuda runtime still necessary for library user even when compiling with -cudart static

c++

cuda

static-linking

dynamic-linking