如何使用 simd 而不是 avx512 将 uint32 转换为 uint8?

how to convert uint32 to uint8 using simd but not avx512?

假设对齐内存中存储了很多 uint32s uint32 *p,如何使用 simd 将它们转换为 uint8s?

我看到有_mm256_cvtepi32_epi8/vpmovdb但是它属于avx512,我的cpu不支持

如果你真的有很多,我会做这样的事情(未经测试)。

主循环每次迭代读取 64 个字节,包含 16 个 uint32_t 值,围绕实现截断的字节进行混洗,将结果合并到一个寄存器中,并使用向量存储指令写入 16 个字节。

void convertToBytes( const uint32_t* source, uint8_t* dest, size_t count )
{
    // 4 bytes of the shuffle mask to fetch bytes 0, 4, 8 and 12 from a 16-bytes source vector
    constexpr int shuffleScalar = 0x0C080400;
    // Mask to shuffle first 8 values of the batch, making first 8 bytes of the result
    const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar, -1, -1, -1, -1, shuffleScalar, -1, -1 );
    // Mask to shuffle last 8 values of the batch, making last 8 bytes of the result
    const __m256i shuffMaskHigh = _mm256_setr_epi32( -1, -1, shuffleScalar, -1, -1, -1, -1, shuffleScalar );
    // Indices for the final _mm256_permutevar8x32_epi32
    const __m256i finalPermute = _mm256_setr_epi32( 0, 5, 2, 7, 0, 5, 2, 7 );

    const uint32_t* const sourceEnd = source + count;
    // Vectorized portion, each iteration handles 16 values.
    // Round down the count making it a multiple of 16.
    const size_t countRounded = count & ~( (size_t)15 );
    const uint32_t* const sourceEndAligned = source + countRounded;
    while( source < sourceEndAligned )
    {
        // Load 16 inputs into 2 vector registers
        const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
        const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
        source += 16;
        // Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
        const __m256i low = _mm256_shuffle_epi8( s1, shuffMaskLow );
        const __m256i high = _mm256_shuffle_epi8( s2, shuffMaskHigh );
        // Unused bytes were zeroed out, using bitwise OR to merge, very fast.
        const __m256i res32 = _mm256_or_si256( low, high );
        // Final shuffle of the 32-bit values into correct positions
        const __m256i res16 = _mm256_permutevar8x32_epi32( res32, finalPermute );
        // Store lower 16 bytes of the result
        _mm_storeu_si128( ( __m128i* )dest, _mm256_castsi256_si128( res16 ) );
        dest += 16;
    }

    // Deal with the remainder
    while( source < sourceEnd )
    {
        *dest = (uint8_t)( *source );
        source++;
        dest++;
    }
}