收集 16 位整数的 AVX2&512 内在函数？

Question

想象一下这段代码：

void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
    for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};

这确实需要收集内在函数，例如_mm_i32gather_epi32。我在加载浮点数时取得了巨大的成功，但是有没有用于 16 位整数的？这里的另一个问题是我需要从输入的 16 位转换为输出的 32 位（浮点）。

Answer 1

确实没有收集 16 位整数的指令，但是（假设没有内存访问冲突的风险）您可以从相应地址开始加载 32 位整数，并屏蔽掉每个值的上半部分。对于 uint16_t 这将是一个简单的位，对于有符号整数，您可以将值向左移动以使符号位位于最重要的位置。然后，您可以（算术地）在将值转换为浮点数之前将它们移回原位，或者，由于无论如何都将它们相乘，因此只需相应地缩放乘法因子即可。或者，您可以从较早的两个字节开始加载并算术右移。无论哪种方式，您的瓶颈可能是负载端口（vpgatherdd 需要 8 个负载微指令。连同索引的负载，您有 9 个负载分布在两个端口上，这应该导致 8 个 4.5 个周期元素）。

未测试可能的 AVX2 实现（不处理最后的元素，如果 cnt 不是 8 的倍数，则在最后执行原始循环）：

void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
    __m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
    for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
        // load indicies:
        __m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
        // load 16bit integers in the lower halves + garbage in the upper halves:
        __m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
        // shift each value to upper half (removes garbage, makes sure sign is at the right place)
        // values are too large by a factor of 0x10000
        values = _mm256_slli_epi32(values, 16);
        // convert to float, scale and multiply:
        __m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
        // store result
        _mm256_storeu_ps(dst, fvalues);
    } 
}

将其移植到 AVX-512 应该很简单。

收集 16 位整数的 AVX2&512 内在函数？

Gather AVX2&512 intrinsic for 16-bit integers?

optimization

avx2

avx512