是否可以将一部分共享内存专用于每个线程？

Question

我在 Ubuntu 18.04 上使用 CUDA 10.1，我想知道是否可以将一部分共享内存专用于每个线程。我的意思是，比方说，我希望每个线程都有一个相对较大的私有数组，它不适合寄存器，将数组放入共享内存会引入竞争条件，因为块上的每个线程都可以访问它。因此，使用一部分共享内存作为 "register" 对我来说非常有趣，这样我就可以避免竞争条件。我知道它会通过限制我可以在 SM 中使用的线程数来限制我的占用，但在我的情况下，这种权衡是值得的。感谢您的帮助！

Answer 1

不，不可能将共享内存分配（静态或动态）分配给特定线程。共享内存只有块作用域。

但是，完全有可能设计一种索引方案，块内的线程可以独占地使用块范围共享内存分配内的唯一位置。例如：

template<int nthreads, int words_per_thread>
__global__
void kernel(..)
{
    __shared__ int buffer[nthreads * words_per_thread];

    int* localbuff = &buffer[threadIdx.x * words_per_thread];

    // localbuff is now safely indexed from [0] to [words_per_thread-1]

}

另一种可能的方法是这样的：

#include <stdio.h>
template<typename T>
class sr_
{
  T *sp;
  public:
  __device__
  sr_(T *_sd) { sp = _sd;}
  __device__
  T &operator[](int idx){return sp[blockDim.x*idx+threadIdx.x];}
};
// for statically allocated shared memory
#define SREG(size,type,block_size) \
  __shared__ type SREG_sdata[size*block_size]; \
  typedef type SREG_type; \
  sr_<SREG_type> sreg(SREG_sdata);
// for dynamically allocated shared memory
#define DSREG(type) \
  __shared__ type SREG_sdata[]; \
  typedef type SREG_type; \
  sr_<SREG_type> sreg(SREG_sdata);

const int BS = 8;

__global__ void k2(){
  SREG(8,float,BS)
  sreg[0] = 1.0f;
  printf("%f\n", sreg[0]);
}

int main(){

  k2<<<1,BS>>>();
  cudaDeviceSynchronize();
}

它的好处是一个线程不可能索引到另一个线程的space，而且也不会有任何银行冲突。请注意，这不会处理所有用例。它必须被修改，例如，如果在同一个模块中有多个内核使用不同的数据类型共享内存。

是否可以将一部分共享内存专用于每个线程？

Is it possible to dedicate a portion of shared memory to each thread?

multithreading

cuda

gpgpu