是否可以使用 HLSL 计算着色器写入非 4 字节对齐的地址？

Question

我正在尝试将现有的 OpenCL 内核转换为 HLSL 计算着色器。

OpenCL 内核对 RGBA 纹理中的每个像素进行采样，并将每个颜色通道写入一个紧密排列的数组。

所以基本上，我需要写入一个紧凑的 uchar 数组，其模式有点像这样：

r r r ... r g g g ... g b b b ... b a a a ... a

其中每个字母代表来自像素通道的单个字节（红色/绿色/蓝色/alpha）。

查看 RWByteAddressBuffer Store 方法的文档，它清楚地指出：

void Store(
  in uint address,
  in uint value
);

address [in]

Type: uint

The input address in bytes, which must be a multiple of 4.

为了将正确的模式写入缓冲区，我必须能够将单个字节写入非对齐地址。在 OpenCL / CUDA 中，这是非常微不足道的。

在技术上是否可以使用 HLSL 实现该目标？
这是已知的限制吗？可能的解决方法？

Answer 1

据我所知，在这种情况下无法直接写入未对齐的地址。但是，您可以使用一些小技巧来实现您想要的。下面你可以看到整个计算着色器的代码，它完全符合你的要求。函数 StoreValueAtByte 特别是您正在寻找的。

Texture2D<float4> Input;
RWByteAddressBuffer Output;

void StoreValueAtByte(in uint index_of_byte, in uint value) {

    // Calculate the address of the 4-byte-slot in which index_of_byte resides
    uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;

    // Calculate which byte within the 4-byte-slot it is
    uint location = index_of_byte % 4;

    // Shift bits to their proper location within its 4-byte-slot
    value = value << ((3 - location) * 8);

    // Write value to buffer
    Output.InterlockedOr(addr_align4, value);
}

[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {

    // Get width and height of texture
    uint tex_width, tex_height;
    Input.GetDimensions(tex_width, tex_height);

    // Make sure thread does not operate outside the texture
    if(tex_width > ID.x && tex_height > ID.y) {

        uint num_pixels = tex_width * tex_height;

        // Calculate address of where to write color channel data of pixel
        uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;

        // Get color of pixel and convert from [0,1] to [0,255]
        float4 color = Input[ID.xy];
        uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));      

        // Store color channel values in output buffer
        StoreValueAtByte(addr_red, color_final.x);
        StoreValueAtByte(addr_green, color_final.y);
        StoreValueAtByte(addr_blue, color_final.z);
        StoreValueAtByte(addr_alpha, color_final.w);
    }
}

我希望代码是不言自明的，因为它很难解释，但无论如何我都会尝试。
函数 StoreValueAtByte 做的第一件事是计算包含要写入的字节的 4 字节槽的地址。之后计算字节在 4 字节槽中的位置（槽中的第一个、第二个、第三个或第四个字节）。由于要写入的字节已经在一个 4 字节变量（即 value）中并且占据了最右边的字节，因此您只需将字节移动到 4 字节变量中的适当位置即可。之后，您只需将变量 value 写入 4 字节对齐地址处的缓冲区。这是使用 bitwise OR 完成的，因为多个线程写入同一地址会相互干扰，从而导致写入后写入的危险。这当然只有在发出调度调用之前用零初始化整个输出缓冲区时才有效。

是否可以使用 HLSL 计算着色器写入非 4 字节对齐的地址？

Is it possible to write to a non 4-bytes aligned address with HLSL compute shader?

direct3d

hlsl

compute-shader

directcompute

direct3d11