支持“shfl()”和“shfl_sync()”指令的正确方法是什么？

Question

据我了解，CUDA 10.1 删除了 shfl 指令：

PTX ISA version 6.4 removes the following features:

Support for shfl and vote instructions without the .sync qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.

支持 shfl 未来和过去的 CUDA 版本的正确方法是什么？

我当前的方法（下面分享）导致使用 CUDA 10.1 时出错：

ptxas ... line 466727; error   : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4

template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
  return var;
}

此外，我想补充一点，我的项目的依赖项之一是 CUB，我相信他们使用相同的方法来拆分 _sync() 和更早的 shfl 指令.我不确定我做错了什么。

Answer 1

我做对了，结果发现另一个依赖项不支持 sync，为它创建了一个拉取请求：https://github.com/moderngpu/moderngpu/pull/32

template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if ( __CUDA_ARCH__ >= 300)
#if (__CUDACC_VER_MAJOR__ >= 9)
  var = __shfl_up_sync(mask, var, delta, width);
#else
  var = __shfl_up(var, delta, width);
#endif
#endif
  return var;
}

支持“shfl()”和“shfl_sync()”指令的正确方法是什么？

What is the correct way to support `shfl()` and `shfl_sync()` instructions?

cuda

ptxas

ptx

支持“__shfl()”和“__shfl_sync()”指令的正确方法是什么？

What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?

cuda

ptxas

ptx

支持“shfl()”和“shfl_sync()”指令的正确方法是什么？

What is the correct way to support `shfl()` and `shfl_sync()` instructions?