使用 HIP C++ 在 AMD GPU 上使用 `shfl` 操作有什么要求？

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

有AMD HIP C++，与CUDA C++非常相似。 AMD 还创建了 Hipify 以将 CUDA C++ 转换为 HIP C++（可移植 C++ 代码），它可以在 nVidia GPU 和 AMD GPU 上执行：https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

在 nVidia GPU 上使用 shfl 操作有要求：https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia

requirement for nvidia

please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.

还注意到 HIP 在 AMD 上支持 shfl 64 波大小（WARP 大小）：https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly

In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.

但是哪个 AMD GPU 支持功能 shfl，或者任何 AMD GPU 支持 shfl 因为在 AMD GPU 上它是通过使用本地内存实现的，没有硬件指令寄存器到寄存器？

nVidia GPU 需要 3.0 或更高的计算能力 (CUDA CC)，但是使用 HIP C++ 在 AMD GPU 上使用 shfl 操作有什么要求？

是，GPU GCN3中有新的指令如ds_bpermute和ds_permute可以提供__shfl()甚至更多
这些ds_bpermute和ds_permute指令只使用Local memory(LDS 8.6 TB/s)的路由，但实际上并不使用Local memory，这允许加速线程间的数据交换：8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.

还有数据并行原语 (DPP) - 当您可以使用它时特别强大，因为操作可以直接读取相邻工作项的寄存器。 IE。 DPP可以全速访问相邻线程（workitem）~51.6TB/s

http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

now, most of the vector instructions can do cross-lane reading at full throughput.

例如，wave_shr-扫描算法的指令（波前右移）：

New Instructions

“SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.

“DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.

DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.

...

DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.

使用 HIP C++ 在 AMD GPU 上使用 `shfl` 操作有什么要求？

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?

concurrency

gpgpu

amd

llvm

hip