使用 SSE2（作为浮点数）缩放字节像素值 (y=ax+b)？

Question

我想计算y = ax + b，其中x和y是一个像素值[即取值范围为0~255的字节]，而a和b是一个浮动

由于我需要对图像中的每个像素应用此公式，此外，a 和 b 因像素而异。在C++中直接计算很慢，所以我有点想知道C++中的sse2指令..

查找后发现float with sse2的乘法和加法与_mm_mul_ps和_mm_add_ps一样。但首先我需要将字节中的 x 转换为浮点数（4 字节）。

问题是，我从字节数据源（_mm_load_si128）加载数据后，如何将数据从字节数据转换为浮点数？

Answer 1

我猜您正在寻找 __m128 _mm_cvtpi8_ps(__m64 a ) 复合内在函数。

这是一个最小的例子：

#include <xmmintrin.h>
#include <stdio.h>

int main() {
  unsigned char  a[4] __attribute__((aligned(32)))= {1,2,3,4};
  float b[4] __attribute__((aligned(32)));
  _mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
  printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
  return 0;
}

Answer 2

a和b每个像素点不一样？这将使向量化变得困难，除非有一个模式，或者你可以在向量中生成它们。

有什么方法可以有效地生成向量中的 a 和 b，无论是定点还是浮点？如果不是，插入 4 个 FP 值，或 8 个 16 位整数，可能比仅标量操作更糟糕。

定点

如果 a 和 b 完全可以重复使用，或者首先用定点生成，这可能是定点数学的一个很好的用例。（即表示值 * 2^scale 的整数）。 SSE/AVX 没有 8b*8b->16b 乘法；最小的元素是单词，所以你必须将字节解包为单词，但不是一直到 32 位。这意味着每条指令可以处理两倍的数据。

有一个 _mm_maddubs_epi16 指令，如果 b 和 a 的变化不够频繁，它可能会有用，或者你可以很容易地生成一个交替的向量 a2^ 4 和 b2^1 个字节。显然它确实是 handy for bilinear interpolation，但如果我们可以准备 a 和 b 向量，它仍然可以用最少的改组为我们完成工作。

float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale;  // fixed point scale for a: 2^4
const int bscale = 1<<logbscale;  // fixed point scale for b: 2^1

const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale));  // re-scale b to match a in the 16bit temporary result

for (i=0 ; i<n; i+=16) {
    //__m128i avec = get_scaled_a(i);  
    //__m128i bvec = get_scaled_b(i);
    //__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
    //__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);

    __m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) );  // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.

    __m128i block = _mm_load_si128(&buf[i]);  // call this { v[0] .. v[15] }

    __m128i lo = _mm_unpacklo_epi8(block, brescale);  // {v[0], 8, v[1], 8, ...}
    __m128i hi = _mm_unpackhi_epi8(block, brescale);  // {v[8], 8, v[9], 8, ...
    lo = _mm_maddubs_epi16(lo, abvec);  // first arg is unsigned bytes, 2nd arg is signed bytes
    hi = _mm_maddubs_epi16(hi, abvec);
    // lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }

    lo = _mm_srli_epi16(lo, logascale);  // truncate from scaled fixed-point to integer
    hi = _mm_srli_epi16(hi, logascale);

    // and re-pack.  Logical, not arithmetic right shift means sign bits can't be set
    block = _mm_packuswb(lo, hi);  
    _mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop

2^4 是任意选择。它为 a 的整数部分留下 3 个非符号位，以及 4 个小数位。因此它有效地将 a 舍入到最接近的第 16 位，如果它的幅度大于 8 和 15/16ths 则溢出。 2^6 会给出更多的小数位，并允许 a 从 -2 到 +1 和 63/64ths。

由于b是加法而不是乘法，它的有用范围要大得多，小数部分的用处要小得多。用8位来表示，四舍五入到最接近的一半仍然保留了一点小数信息，但允许它是[-64：63.5]而不会溢出。

为了更精确，16b 定点是一个不错的选择。您可以将 a 和 b 放大 2^7 或其他值，以获得 7b 的小数精度，并且仍然允许整数部分为 [-256 .. 255]。这种情况下没有乘加指令，因此您必须单独执行。进行乘法运算的不错选择包括：

_mm_mulhi_epu16：无符号 16b*16b->high16（位 [31:16]）。如果 a 不能为负
_mm_mulhi_epi16：有符号 16b*16b->high16（位 [31:16]）。
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding.如果 a 的缩放因子选择得当，这应该会更好。据我了解，SSSE3 引入此指令正是为了这种用途。
_mm_mullo_epi16：有符号 16b*16b->low16（位 [15:0]）。在 low16 结果溢出之前，这只允许 a 的 8 个有效位，所以我认为你通过 _mm_maddubs_epi16 8 位解决方案获得的所有好处是 b.

要使用这些，您将获得 a 和 b 值的缩放 16b 向量，然后：

用零（或pmovzx byte->word）解压你的字节，得到仍在[0..255]范围内的有符号字
将单词左移 7。
乘以您的 a 16b 词向量，取每个 16*16->32 结果的上半部分。（例如 mul
如果您想要 a 和 b 的不同比例，请在此处右移，以获得 a
加上 b。
右移以从固定点返回到 [0..255] 进行最终截断。

如果定点比例选择得当，与 8 位定点相比，这应该能够处理更广泛的 a 和 b，以及更高的小数精度。

如果在将字节解包为单词后不左移字节，a 必须是全范围的才能在结果的高 16 位中设置 8 位。这意味着您可以支持的 a 范围非常有限，而无需在乘法期间将临时值截断为少于 8 位。即使 _mm_mulhrs_epi16 也没有留下太多空间，因为它从第 30 位开始。

将字节扩展为浮点数

如果您不能有效地为每个像素生成定点 a 和 b 值，最好将像素转换为浮点数。这需要更多 unpacking/repacking，因此延迟和吞吐量更差。值得研究用定点生成 a 和 b。

要使 packed-float 起作用，您仍然必须有效地为 4 个相邻像素构建一个 a 值的向量。

这是 pmovzx (SSE4.1) 的一个很好的用例，因为它可以直接从 8b 元素转到 32b。其他选项是具有多个步骤的 SSE2 punpck[l/h]bw/punpck[l/h]wd，或模拟 pmovzx 的 SSSE3 pshufb。（您可以执行一次 16B 加载并以 4 种不同的方式对其进行洗牌，以将其解压缩为 4 个 32b 整数向量。）

char *buf;

// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
    __m128 a = get_a(i);
    __m128 b = get_b(i);

    // IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128.  (unlike punpck*)
    __m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i));  // load 4B at once. 
   //  Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
    __m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
  
    floats = _mm_fmadd_ps(floats, a, b);  // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
    // or without FMA, do this with _mm_mul_ps and _mm_add_ps
    unsigned_dwords = _mm_cvtps_epi32(floats);

    // repeat 3 more times for buf+4, buf+8, and buf+12, then:

    __m128i packed01 = _mm_packss_epi32(dwords0, dwords1);  // SSE2
    __m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
    // packuswb wants SIGNED input, so do signed saturation on the first step

    // saturate into [0..255] range
    __m12i8 packedbytes=_mm_packus_epi16(packed01, packed23);  // SSE2
    _mm_store_si128(buf+i, packedbytes);  // or storeu if buf isn't aligned.
}

// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0

（回复：可以作为 pmovzxbd 的内存源操作数的加载，另请参阅 re: the problems compilers have with this.) And see also GCC bug 99754 - _mm_loadu_si32 的错误代码 - 反向向量元素。

此答案的前一版本来自 packusdw/packuswb 的 float->uint8 向量，并且有一整节都是关于没有 SSE4.1 的解决方法。 None 如果您只是，则需要在未签名包之后屏蔽符号位。我认为这就是 SSE2 只包含从 dword 到 word 的签名包，但同时包括从 word 到字节的签名包和未签名包的原因。 packuswd 仅在您的最终目标是 uint16_t 时才有用，而不是进一步打包。

最后一个 CPU 到没有有 SSE4.1 的是英特尔 Conroe/merom（第一代 Core2，从 2007 年底之前）和 AMD pre Barcelona （2007 年底之前）。如果 working-but-slow 对于那些 CPUs 是可以接受的，只需编写一个 AVX2 版本和一个 SSE4.1 版本。或者 SSSE3（使用 4x pshufb 来模拟寄存器的四个 32b 元素的 pmovzxbd）pshufb 在 Conroe 上很慢，但是，所以如果你关心没有 SSE4.1 的 CPUs，写一个特定的版本。其实Conroe/merom还有slow xmmpunpcklbw等等（q->dq除外）。 4 倍慢 pshufb 应该仍然胜过 6 倍慢解包。由于解包和重新打包的缓慢洗牌，矢量化在 pre-Wolfdale 上的胜利要少得多。固定点版本，unpacking/repacking，将在那里有更大的优势。

在我意识到它需要多少额外指令之前，查看编辑历史以了解使用 punpck 的未完成尝试。删除它是因为这个答案已经很长了，另一个代码块会令人困惑。

使用 SSE2（作为浮点数）缩放字节像素值 (y=ax+b)？

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

c++

x86

simd

sse2

visual-studio

定点

将字节扩展为浮点数