使用 HIP C++ 在 AMD GPU 上使用 `shfl` 操作有什么要求?

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

AMD HIP C++,与CUDA C++非常相似。 AMD 还创建了 Hipify 以将 CUDA C++ 转换为 HIP C++(可移植 C++ 代码),它可以在 nVidia GPU 和 AMD GPU 上执行:https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

requirement for nvidia

please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.

In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.

但是哪个 AMD GPU 支持功能 shfl,或者任何 AMD GPU 支持 shfl 因为在 AMD GPU 上它是通过使用本地内存实现的,没有硬件指令寄存器到寄存器?

nVidia GPU 需要 3.0 或更高的计算能力 (CUDA CC),但是使用 HIP C++ 在 AMD GPU 上使用 shfl 操作有什么要求?

  1. ,GPU GCN3中有新的指令如ds_bpermuteds_permute可以提供__shfl()甚至更多

  2. 等功能
  3. 这些ds_bpermuteds_permute指令只使用Local memory(LDS 8.6 TB/s)的路由,但实际上并不使用Local memory,这允许加速线程间的数据交换:8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.

  1. 还有数据并行原语 (DPP) - 当您可以使用它时特别强大,因为操作可以直接读取相邻工作项的寄存器。 IE。 DPP可以全速访问相邻线程(workitem)~51.6TB/s

http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

now, most of the vector instructions can do cross-lane reading at full throughput.

例如,wave_shr-扫描算法的指令(波前右移):

关于 GCN3 的更多信息:https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf

New Instructions

  • “SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.
  • “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.
  • DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.

...

DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.