支持“__shfl()”和“__shfl_sync()”指令的正确方法是什么?
What is the correct way to support `__shfl()` and `__shfl_sync()` instructions?
据我了解,CUDA 10.1 删除了 shfl
指令:
PTX ISA version 6.4 removes the following features:
Support for shfl
and vote instructions without the .sync
qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.
支持 shfl
未来和过去的 CUDA 版本的正确方法是什么?
我当前的方法(下面分享)导致使用 CUDA 10.1 时出错:
ptxas ... line 466727; error : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
var = __shfl_up_sync(mask, var, delta, width);
#else
var = __shfl_up(var, delta, width);
#endif
return var;
}
此外,我想补充一点,我的项目的依赖项之一是 CUB,我相信他们使用相同的方法来拆分 _sync()
和更早的 shfl
指令.我不确定我做错了什么。
我做对了,结果发现另一个依赖项不支持 sync
,为它创建了一个拉取请求:https://github.com/moderngpu/moderngpu/pull/32
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if ( __CUDA_ARCH__ >= 300)
#if (__CUDACC_VER_MAJOR__ >= 9)
var = __shfl_up_sync(mask, var, delta, width);
#else
var = __shfl_up(var, delta, width);
#endif
#endif
return var;
}
据我了解,CUDA 10.1 删除了 shfl
指令:
PTX ISA version 6.4 removes the following features:
Support for
shfl
and vote instructions without the.sync
qualifier has been removed for .targetsm_70 and higher. This support was deprecated since PTX ISA version 6.0 as documented in PTX ISA version 6.2.
支持 shfl
未来和过去的 CUDA 版本的正确方法是什么?
我当前的方法(下面分享)导致使用 CUDA 10.1 时出错:
ptxas ... line 466727; error : Instruction 'shfl' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if (__CUDACC_VER_MAJOR__ >= 9)
var = __shfl_up_sync(mask, var, delta, width);
#else
var = __shfl_up(var, delta, width);
#endif
return var;
}
此外,我想补充一点,我的项目的依赖项之一是 CUB,我相信他们使用相同的方法来拆分 _sync()
和更早的 shfl
指令.我不确定我做错了什么。
我做对了,结果发现另一个依赖项不支持 sync
,为它创建了一个拉取请求:https://github.com/moderngpu/moderngpu/pull/32
template <typename T>
__device__ static __forceinline__
T _shfl_up(T var, unsigned int delta, int width=WARPSIZE, unsigned mask=MEMBERMASK)
{
#if ( __CUDA_ARCH__ >= 300)
#if (__CUDACC_VER_MAJOR__ >= 9)
var = __shfl_up_sync(mask, var, delta, width);
#else
var = __shfl_up(var, delta, width);
#endif
#endif
return var;
}