在 CUDA 内核中生成随机数时，cuRAND 的性能比 thrust 差得多

Question

我正在尝试使用两种不同的方法从 CUDA __global__ 内核中的均匀分布生成 "random" 数字。第一个是使用 cuRAND 设备 API，第二个是使用 thrust。对于每种方法，我都创建了不同的 class。

这是我的 cuRAND 解决方案：

template<typename T>
struct RNG1
{
    __device__
    RNG1(unsigned int tid) {
        curand_init(tid, tid, 0, &state);
    }

    __device__ T
    operator ()(void) {
        return curand_uniform(&state);
    }

    curandState state;
};

这是我的 thrust 解决方案：

template<typename T>
struct RNG2
{
    __device__
    RNG2(unsigned int tid)
        : gen(tid)
        , dis(0, 1) { gen.discard(tid); }

    __device__ T
    operator ()(void) {
        return dis(gen);
    }

    thrust::default_random_engine gen;
    thrust::uniform_real_distribution<T> dis;
};

我的使用方法如下：

template<typename T> __global__ void
mykernel(/* args here */)
{
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

    RNG1<T> rng(tid);
    // or
    RNG2<T> rng(tid);

    T a_random_number = rng();  

    // do stuff here
}

它们都可以工作，但 cuRAND 解决方案要慢得多（慢 3 倍以上）。如果我把curand_init的第二个参数（序号）设置为0，那么性能和thrust的方案是一样的，但是随机数是"bad"。我可以在生成的分布中看到模式和人工制品。

这是我的两个问题：

有人可以向我解释为什么 cuRAND 具有非零序列的解决方案速度较慢吗？
thrust如何在零序列的情况下与cuRAND一样快，而且还能生成良好的随机数？
在 Google 上搜索时，我注意到大多数人使用 cuRAND，很少有人使用 thrust 在设备代码中生成随机数。有什么我应该注意的吗？我在滥用 thrust 吗？

谢谢。

Answer 1

可能会出现性能差异，因为 cuRAND 和 Thrust 使用不同的 PRNG 算法，具有不同的性能配置文件和对内存的要求。请注意，cuRAND 支持五种不同的 PRNG 算法，您的代码没有给出正在使用的算法。

Thrust 的 default_random_engine 当前为 minstd_rand，但其文档指出此 "may change in a future version"。（在我写完我的评论后写的评论也指出它是 minstd_rand。）minstd_rand 是一个简单的线性同余生成器，可能比 PRNG cuRAND 使用的任何东西都快。

这是一条转换为答案并经过编辑的评论。

在 CUDA 内核中生成随机数时，cuRAND 的性能比 thrust 差得多

cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels

random

cuda

thrust

curand