关于矢量化和循环大小的令人费解的 GCC 行为

Question

最初调查 #pragma omp simd 指令的效果时，我遇到了一个我无法解释的行为，它与简单 for 循环的矢量化有关。如果应用了 -O3 指令并且我们在 x86 架构上，则可以在这个很棒的 compiler explorer 上测试以下代码示例。

任何人都可以向我解释以下观察结果背后的逻辑吗？

#include <stdint.h> 

void test(uint8_t* out, uint8_t const* in, uint32_t length)
{
    unsigned const l1 = (length * 32)/32;  // This is vectorized
    unsigned const l2 = (length / 32)*32;  // This is not vectorized

    unsigned const l3 = (length << 5)>>5;  // This is vectorized
    unsigned const l4 = (length >> 5)<<5;  // This is not vectorized

    unsigned const l5 = length -length%32; // This is not vectorized
    unsigned const l6 = length & ~(32 -1); // This is not vectorized

    for (unsigned i = 0; i<l1 /*pick your choice*/; ++i)
    {
      out[i] = in[i*2];
    }
}

令我困惑的是，尽管不能保证是 32 的倍数，但 l1 和 l3 都会生成矢量化代码。所有其他长度不会生成矢量化代码，但应该是 32 的倍数。这背后有什么原因吗？

顺便说一句，使用#pragma omp simd 指令实际上并没有改变任何东西。

编辑：经过进一步调查，当索引类型为 size_t 时，行为差异消失（甚至不需要边界操作），这意味着这会生成矢量化代码：

#include <stdint.h> 
#include <string>

void test(uint8_t* out, uint8_t const* in, size_t length)
{
    for (size_t i = 0; i<length; ++i)
    {
        out[i] = in[i*2];
    }
}

如果有人知道为什么循环矢量化如此依赖索引类型，我很想知道更多！

Edit2，感谢Mark Lakata，实际上需要O3

Answer 1

问题是数组索引中从 unsigned 到 size_t 的明显转换¹：in[i*2];

如果您使用 l1 或 l3，则 i*2 的计算将始终适合类型 size_t。这意味着类型 unsigned 实际上的行为就好像它是 size_t.

但是当您使用其他选项时，计算结果 i*2 可能不适合 size_t，因为值可能会换行并且必须进行转换。

如果您采用第一个示例，而不是选择选项 l1 或 l3，然后进行转换：

out[i] = in[( size_t )i*2];

编译器优化，如果你转换整个表达式：

out[i] = in[( size_t )(i*2)];

没有。

¹ 标准实际上并没有指定索引中的类型必须是 size_t，但从编译器的角度来看这是一个合乎逻辑的步骤。

Answer 2

我认为您将优化与矢量化混淆了。我使用了你的 compiler explorer 并为 x86 设置了 -O2，none 的例子是 "vectorized".

这里是l1

test(unsigned char*, unsigned char const*, unsigned int):
        xorl    %eax, %eax
        andl    4217727, %edx
        je      .L1
.L5:
        movzbl  (%rsi,%rax,2), %ecx
        movb    %cl, (%rdi,%rax)
        addq    , %rax
        cmpl    %eax, %edx
        ja      .L5
.L1:
        rep ret

这里是l2

test(unsigned char*, unsigned char const*, unsigned int):
        andl    $-32, %edx
        je      .L1
        leal    -1(%rdx), %eax
        leaq    1(%rdi,%rax), %rcx
        xorl    %eax, %eax
.L4:
        movl    %eax, %edx
        addq    , %rdi
        addl    , %eax
        movzbl  (%rsi,%rdx), %edx
        movb    %dl, -1(%rdi)
        cmpq    %rcx, %rdi
        jne     .L4
.L1:
        rep ret

这并不奇怪，因为您所做的本质上是一个 "gather" 加载操作，其中加载索引与存储索引不同。 x86 不支持 gather/scatter。只在AVX2和AVX512中引入，那个没选

稍长的代码正在处理 signed/unsigned 问题，但没有进行矢量化。

关于矢量化和循环大小的令人费解的 GCC 行为

Puzzling GCC behaviour with respect to vectorization and loop size

c

c++

gcc

vector

auto-vectorization