官方 OpenCL 2.2 标准是否支持 WaveFront?
Does the official OpenCL 2.2 standard support the WaveFront?
众所周知,AMD-OpenCL 支持 WaveFront(2015 年 8 月):http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
The AMD Radeon HD 7770 GPU, for example, supports more than 25,000
in-flight work-items and can switch to a new wavefront (containing up
to 64 work-items) in a single cycle.
但是为什么OpenCL标准1.0/2.0/2.2中没有提到WaveFront呢?
None 的 PDF 没有一个字 WaveFront: https://www.khronos.org/registry/OpenCL/specs/
我还发现:
OpenCL is a open standard. It still does not support this swizzling
concept. It does not even support wavefront/warp yet.
- 2013:
That's why the concept is not on the OpenCL specification itself.
- 2011: https://forums.khronos.org/showthread.php/7211-How-can-i-split-my-work-load-in-a-GPU-with-OpenCL
Standard OpenCL doesn't have the notion of a "wavefront"
确实OpenCL 2.2官方标准还不支持WaveFront?
结论:
OpenCL 标准中没有 WaveFront,但 在 OpenCL-2.0 中有类似于 WaveFronts.
的具有 SIMD 执行模型的子组
6.4.2 Workgroup/subgroup-level functions
OpenCL 2.0 introduces a Khronos sub-group extension. Sub-groups are a
logical abstraction of the hardware SIMD execution model akin to
wavefronts, warps, or vectors and permit programming closer to the
hardware in a vendor-independent manner. This extension includes a set
of cross-sub-group built-in functions that match the set of the
cross-work-group built-in functions specified above.
他们一定是采用了一种更动态的方法,称为 sub-group
:https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
Sub-group: Sub-groups are an implementation-dependent grouping of work-items within a
work-group. The size and number of sub-groups is implementation-defined.
和
Work-groups are further divided into sub-groups,
which provide an additional level of control over execution.
和
The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime.
所以即使它不叫波前,它现在可以在 run-time 和
中查询
In the absence of synchronization functions (e.g. a barrier),
work-items within a sub-group may be serialized. In the presence of
sub -group functions, work-items within a sub -group may be serialized
before any given sub -group function, between dynamically encountered
pairs of sub - group functions and between a work-group function and
the end of the kernel.
即使是步调一致的方式有时也会丢失。
除此之外,
sub_group_all() and
sub_group_broadcast() and are described in OpenCL C++ kernel language and IL specifications.
The use of these sub-group functions implies sequenced-before relationships between statements
within the execution of a single work-item in order to satisfy data dependencies.
说存在某种 intra-sub-group 通信。因为现在 opencl 有 child-kernel 定义:
Device-side enqueue: A mechanism whereby a kernel-instance is enqueued by a kernel-instance
running on a device without direct involvement by the host program. This produces nested
parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance.
The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the
child kernel) to a device-side command queue. Child and parent kernels execute asynchronously
though a parent kernel does not complete until all of its child-kernels have completed.
最终,像
kernel void launcher()
{
ndrange_t ndrange = ndrange_1D(1);
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
^{
size_t id = get_global_id(0);
}
);
}
你应该能够生成你自己的(升级的?)wavefronts 具有你需要的任何大小并且它们与父内核同时工作(并且可以通信 intra-sub-group 线程)但它们不被称为 wavefronts 因为它们是恕我直言,硬件没有硬编码。
2.0 api 规格说明:
Extreme care should be exercised when writing code that uses
subgroups if the goal is to write portable OpenCL applications.
这让人想起 amd 的 16 宽 simds 和 nvidia 的 32 宽 simds 与一些想象中的 fpga 的 95 宽计算核心。 Pseudo-wavefront 也许吧?
众所周知,AMD-OpenCL 支持 WaveFront(2015 年 8 月):http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
The AMD Radeon HD 7770 GPU, for example, supports more than 25,000 in-flight work-items and can switch to a new wavefront (containing up to 64 work-items) in a single cycle.
但是为什么OpenCL标准1.0/2.0/2.2中没有提到WaveFront呢?
None 的 PDF 没有一个字 WaveFront: https://www.khronos.org/registry/OpenCL/specs/
我还发现:
OpenCL is a open standard. It still does not support this swizzling concept. It does not even support wavefront/warp yet.
- 2013:
That's why the concept is not on the OpenCL specification itself.
- 2011: https://forums.khronos.org/showthread.php/7211-How-can-i-split-my-work-load-in-a-GPU-with-OpenCL
Standard OpenCL doesn't have the notion of a "wavefront"
确实OpenCL 2.2官方标准还不支持WaveFront?
结论:
OpenCL 标准中没有 WaveFront,但 在 OpenCL-2.0 中有类似于 WaveFronts.
的具有 SIMD 执行模型的子组6.4.2 Workgroup/subgroup-level functions
OpenCL 2.0 introduces a Khronos sub-group extension. Sub-groups are a logical abstraction of the hardware SIMD execution model akin to wavefronts, warps, or vectors and permit programming closer to the hardware in a vendor-independent manner. This extension includes a set of cross-sub-group built-in functions that match the set of the cross-work-group built-in functions specified above.
他们一定是采用了一种更动态的方法,称为 sub-group
:https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf
Sub-group: Sub-groups are an implementation-dependent grouping of work-items within a
work-group. The size and number of sub-groups is implementation-defined.
和
Work-groups are further divided into sub-groups,
which provide an additional level of control over execution.
和
The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime.
所以即使它不叫波前,它现在可以在 run-time 和
中查询In the absence of synchronization functions (e.g. a barrier), work-items within a sub-group may be serialized. In the presence of sub -group functions, work-items within a sub -group may be serialized before any given sub -group function, between dynamically encountered pairs of sub - group functions and between a work-group function and the end of the kernel.
即使是步调一致的方式有时也会丢失。
除此之外,
sub_group_all() and
sub_group_broadcast() and are described in OpenCL C++ kernel language and IL specifications.
The use of these sub-group functions implies sequenced-before relationships between statements
within the execution of a single work-item in order to satisfy data dependencies.
说存在某种 intra-sub-group 通信。因为现在 opencl 有 child-kernel 定义:
Device-side enqueue: A mechanism whereby a kernel-instance is enqueued by a kernel-instance
running on a device without direct involvement by the host program. This produces nested
parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance.
The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the
child kernel) to a device-side command queue. Child and parent kernels execute asynchronously
though a parent kernel does not complete until all of its child-kernels have completed.
最终,像
kernel void launcher()
{
ndrange_t ndrange = ndrange_1D(1);
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
^{
size_t id = get_global_id(0);
}
);
}
你应该能够生成你自己的(升级的?)wavefronts 具有你需要的任何大小并且它们与父内核同时工作(并且可以通信 intra-sub-group 线程)但它们不被称为 wavefronts 因为它们是恕我直言,硬件没有硬编码。
2.0 api 规格说明:
Extreme care should be exercised when writing code that uses
subgroups if the goal is to write portable OpenCL applications.
这让人想起 amd 的 16 宽 simds 和 nvidia 的 32 宽 simds 与一些想象中的 fpga 的 95 宽计算核心。 Pseudo-wavefront 也许吧?