使用 HIP C++ 在 AMD GPU 上使用 `shfl` 操作有什么要求?
What are the requirements for using `shfl` operations on AMD GPU using HIP C++?
有AMD HIP C++,与CUDA C++非常相似。 AMD 还创建了 Hipify 以将 CUDA C++ 转换为 HIP C++(可移植 C++ 代码),它可以在 nVidia GPU 和 AMD GPU 上执行:https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP
- 在 nVidia GPU 上使用
shfl
操作有要求:https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia
requirement for nvidia
please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.
- 还注意到 HIP 在 AMD 上支持
shfl
64 波大小(WARP 大小):https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly
In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
但是哪个 AMD GPU 支持功能 shfl
,或者任何 AMD GPU 支持 shfl
因为在 AMD GPU 上它是通过使用本地内存实现的,没有硬件指令寄存器到寄存器?
nVidia GPU 需要 3.0 或更高的计算能力 (CUDA CC),但是使用 HIP C++ 在 AMD GPU 上使用 shfl
操作有什么要求?
是,GPU GCN3中有新的指令如ds_bpermute
和ds_permute
可以提供__shfl()
甚至更多
等功能
这些ds_bpermute
和ds_permute
指令只使用Local memory(LDS 8.6 TB/s)的路由,但实际上并不使用Local memory,这允许加速线程间的数据交换:8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
- 还有数据并行原语 (DPP) - 当您可以使用它时特别强大,因为操作可以直接读取相邻工作项的寄存器。 IE。 DPP可以全速访问相邻线程(workitem)~51.6TB/s
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
now, most of the vector instructions can do cross-lane reading at full
throughput.
例如,wave_shr
-扫描算法的指令(波前右移):
关于 GCN3 的更多信息:https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf
New Instructions
- “SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.
- “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.
- DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.
...
DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.
有AMD HIP C++,与CUDA C++非常相似。 AMD 还创建了 Hipify 以将 CUDA C++ 转换为 HIP C++(可移植 C++ 代码),它可以在 nVidia GPU 和 AMD GPU 上执行:https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP
- 在 nVidia GPU 上使用
shfl
操作有要求:https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidia
requirement for nvidia
please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.
- 还注意到 HIP 在 AMD 上支持
shfl
64 波大小(WARP 大小):https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directly
In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
但是哪个 AMD GPU 支持功能 shfl
,或者任何 AMD GPU 支持 shfl
因为在 AMD GPU 上它是通过使用本地内存实现的,没有硬件指令寄存器到寄存器?
nVidia GPU 需要 3.0 或更高的计算能力 (CUDA CC),但是使用 HIP C++ 在 AMD GPU 上使用 shfl
操作有什么要求?
是,GPU GCN3中有新的指令如
ds_bpermute
和ds_permute
可以提供__shfl()
甚至更多 等功能
这些
ds_bpermute
和ds_permute
指令只使用Local memory(LDS 8.6 TB/s)的路由,但实际上并不使用Local memory,这允许加速线程间的数据交换:8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
- 还有数据并行原语 (DPP) - 当您可以使用它时特别强大,因为操作可以直接读取相邻工作项的寄存器。 IE。 DPP可以全速访问相邻线程(workitem)~51.6TB/s
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
now, most of the vector instructions can do cross-lane reading at full throughput.
例如,wave_shr
-扫描算法的指令(波前右移):
关于 GCN3 的更多信息:https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf
New Instructions
- “SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.
- “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.
- DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.
...
DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.