NEON:将 int8x16_t 解包成一对 int16x8 并将一对 int16x8_t 解包成 int8x16_t

NEON: Unpacking int8x16_t into a pair of int16x8 & packing a pair of int16x8_t into a int8x16_t

我正在为我制作的算法的 arm64 实现 NEON 版本。

我面临的问题是:

- 如何将一个 int8x16 解压缩为两个 int16x8_t,这意味着字节是一种 "casted" 短裤?
- 如何将这两个 int16x8_t 打包回一个 int8x16_t?

我尝试这样做的原因是在不溢出的情况下对几个向量化短裤应用操作,最后将结果打包回 int8x16_t

这是我针对这个问题的 SSE2 实现:

SSE2解包:

__m128i a1 = _mm_srai_epi16(_mm_unpacklo_epi8(input, input), 8);
__m128i a2 = _mm_srai_epi16(_mm_unpackhi_epi8(input, input), 8);

SSE2包装:

__m128i output = _mm_packs_epi16(a1, a2);

你可以做到,例如像这样的内在函数:

#include <stdint.h>
#include <arm_neon.h>

void func(int8_t *buf) {
    int8x16_t vec = vld1q_s8(buf); // load 16x int8_t
    int16x8_t short1 = vmovl_s8(vget_low_s8(vec)); // cast the first 8x int8_t to int16_t
    int16x8_t short2 = vmovl_s8(vget_high_s8(vec)); // cast the last 8x int8_t to int16_t
    short1 = vaddq_s16(short1, short1); // Do operation on int16
    short2 = vaddq_s16(short2, short2);
    vec = vcombine_s8(vmovn_s16(short1), vmovn_s16(short2)); // Cast back to int8_t and combine the two vectors
    vst1q_s8(buf, vec); // Store
}