支持“__shfl()”和“__shfl_sync()”指令的正确方法是什么?

What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?

据我了解,CUDA 10.1 删除了 shfl 指令:

PTX ISA version 6.4 removes the following features:

Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.

支持 shfl 未来和过去的 CUDA 版本的正确方法是什么?

我当前的方法(下面分享)导致使用 CUDA 10.1 时出错:

ptxas ... line 466727; error   : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
  return var;
}

此外,我想补充一点,我的项目的依赖项之一是 CUB,我相信他们使用相同的方法来拆分 _sync() 和更早的 shfl 指令.我不确定我做错了什么。

我做对了,结果发现另一个依赖项不支持 sync,为它创建了一个拉取请求:https://github.com/moderngpu/moderngpu/pull/32

template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if ( __CUDA_ARCH__ >= 300)
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
#endif
  return var;
}