如何使用 SIMD 计算在 4 个不同的 Vector128 之间找到最大值

Question

我正在尝试使用 SIMD 计算做一些事情。我在我的问题上已经走了很远，然后我陷入困境，想知道如何做到这一点。

我认为最简单的方法是逐步描述我所做的事情：

我使用 Vector128<byte> 然后一次处理 16 个字节

我创建了一个二维数组 (array2D)，其中包含 9 列，每列 16 行。我将数字按以下顺序排列：0 和 2。这意味着例如 Row: 0 只有 0。行：1 只有 2s 等等
现在我 Avx.LoadVector128 每个 column/dimension 给出：9 Vector128<byte> 我输入：dimensionLIST
现在的任务是计算在每一行中可以找到多少个数字：0 and 2。（我们有 16 行）。这些信息最后存储在：counts[0]
查看MessageBox中counts[0]的结果。如下所示： MessageBox.Show(counts[0]);

（代表16行）
[0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9]

9、隔行发现2个

现在的目标是计算在以下位置找到了多少个“9”：
[0,9,0,9,0,9,0,9,0, 9,0,9,0,9,0,9] 即 8.

所以我们想在这里以某种方式将整数 8 作为标量？

    public unsafe static void SIMDfunction()
    {
        //Create dummy values
        byte[,] array2D = new byte[9, 16]; byte num = 0;
        for (int i = 0; i < 9; i++)
        {
            for (int i2 = 0; i2 < 16; i2++)
            {
                array2D[i, i2] = num;
                if (num == 0) { num = 2; } else { num = 0; }
            }
        }


        /*----------------------------------------------------------------------------------------*/
        unsafe
        {
            //Below starts SIMD calculations!
            fixed (byte* ptr = array2D)
            {
                //Add all 9 dimensions as Vector128
                List<Vector128<byte>> dimensionLIST = new List<Vector128<byte>>();
                for (int i = 0; i < 9; i++)
                {
                    byte* featuredimension = &*((byte*)(ptr + i * 16)); //This gives the first dimension with start: 0
                    dimensionLIST.Add(Avx.LoadVector128(&featuredimension[0])); //add "featuredimension" as a vector of the 16 next numbers: [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]
                }


                //Now count how many of: 0,1,2,3 are found in total in all "dimensionLIST" together?
                Span<Vector128<byte>> counts = stackalloc Vector128<byte>[1];
                Span<Vector128<UInt64>> sum64 = stackalloc Vector128<UInt64>[1];
                byte nr2 = 2; byte nr3 = 9; 
                for (int i = 0; i < dimensionLIST.Count; i++) //Each column
                {
                    //Compare: dimensionLIST[i] with Vector128 val to find out how many matches of 2 in this loop
                    //[0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
                    var match = Avx.CompareEqual(dimensionLIST[i], Vector128.Create(nr2)); //Create Vector128 for numbers: 2
                    counts[0] = Avx.Subtract(counts[0], match);
                }
                //STEP1: Show result on how many 2s are found == 9 occurences of "2"!
                var result = Avx.CompareEqual(Vector128.Create(nr3), counts[0]); //counts[0]: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] (In total 9 2s are found on those indexes)


                //result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255] Puts - 1 where integer == 9
                MessageBox.Show(result.ToString());

                //Now the goal is to count how many "9" that were found in: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] which is 8.
                //So somehow we want the the integer 8 as Scalar somehow here?
            }
        }
    }

Answer 1

已编辑问题的答案：计算一个向量的水平匹配

这似乎过于简单化了，就像在实践中您不会知道 9 是您要查找的值一样。但是既然你 hard-coded 在你的来源中，也许这就是你想要的。

您在 pcmpeqb 的正确轨道上找到了与您要查找的元素完全匹配的内容。然后你只需要对这些匹配项进行水平求和：

// This is C, sorry I don't know C#
#include <immintrin.h>

// input: a vector like {0,9,0,9,...} and a value like 9 or 0
int count_matches(__m128i v, int val)
{
    __m128i matches = _mm_cmpeq_epi8(v, _mm_set1_epi8(val));  // 0 or -1
    matches = _mm_sub_epi8(_mm_setzero_si128(), matches);     // 0 or +1
    __m128i hsums = _mm_sad_epu8(matches, _mm_setzero_si128());  // psadbw against 0 = hsum of each qword half = count ones
    __m128i high = _mm_unpackhi_epi64(hsums, hsums);      // punpckhqdq to extract high half
    hsums = _mm_add_epi32(hsums, high);                // paddd or paddw would be fine
    unsigned sum = _mm_cvtsi128_si32(hsums);           // movd to extract the low 32-bit scalar

    return sum;
}

(Godbolt - 有趣的事实：clang10 使用内存中的 set1_epi8(1) 将 sub“优化”为 pand，即使 -march=skylake 其中 vpsubb 具有与 vpand 相同的性能。愚蠢的编译器。）

即只是对 pcmpeqb 结果中“真实”元素的数量进行水平求和。

在 psadbw 之前用 psubb 取反（或用 set1(1) 做 pand）比我想出的任何将 sum*255 变成更有效sum，如果我们刚刚水平添加了 0 或 255 个元素。

-(int8_t)sum 或 (int8_t)-sum 编译为 movsx eax, al / neg eax 这是 2 条指令（假设我们需要 32 位 int 的结果），比 vpsubb 针对已存在的归零向量。如果没有 AVX，它可能会更好，或者如果你在 back-end SIMD 执行端口而不是 front-end.

上遇到瓶颈

sum/255 显然不好，编译器没有足够的信息来优化它，这就是为什么我的答案不使用它。另一个选项是 (sum + 16) >> 8，它恰好给出了从 0*255 到 16*255 的所有 i*255 值的正确答案。但是 Intel CPU 运行在端口 0 或 6 上的变化，而不是任何 ALU 端口，所以这可能比 neg/movsx 更糟糕。 neg 和 movsx 可以在任何 ALU 端口上运行，因此在避免/不应对来自周围代码的 back-end ALU 压力方面最灵活。

vpsubb 运行s 在 Intel Skylake 及更高版本的 p0、p1、p5 中的任何一个上，但在早期的 CPU 上不太灵活。如果没有 AVX，它可能需要一个 movdqa 寄存器副本，或者重做一个 xor-zeroing 来为 psadbw.

创建一个新的零向量

原始/标题问题的答案，找出每个垂直 SIMD 元素中最大值来自 4 个向量中的哪一个

在你有计数[0..3]之后，如果你的计数顶部有空闲位，左移 2 和 OR 与 0..3 标签号（记录它来自哪里）所以你可以使用 pmaxub 到 select 最大计数，同时带上标签。

SIMD MAX运算会作用于整个(counts[i] << 2) | i，所以count部分是整型值的most-significant部分，但是tag部分是自带的整数 MAX 运算的一部分。

“标签”将充当 tie-breaker，偏向更高的 i（即在您的情况下偏向 3）。如果您需要将相等的计数视为向量 0，则与 3 异或或从 3 开始倒数或类似的东西，以便标记位具有您想要的 tie-break 顺序的整数值。

// kind of pseudo-code, I don't know C# so this is more like C with intrinsics
for (int i=0 ; i<4 ; i++){
    counts[i] <<= 2;   // emulate non-existent psllb somehow; 2x paddb or psllw / pand
    counts[i] |= set1(i);  // low 2 bits = tag
}
__m128i m0 = _mm_max_epu8(counts[0], counts[1]);
__m128i m1 = _mm_max_epu8(counts[2], counts[3]);
__m128i max = _mm_max_epu8(m0, m1);
max = _mm_and_si128(max, _mm_set1_epi8(3));  // discard high 6 bits, keep low 2 = tag

或者如果您没有空间左移而不丢失重要位，使用 set1(i) 解压并使用 _mm_max_epu16 (pmaxuw), high/low 两半分别有一个 2x max -> 1x max 的树。所以每个整数都是 (count<<8) | i.

然后你必须 re-pack 到低字节（标记），可能需要你屏蔽掉 _mm_packs_epi16 (packsswb) 之前的值字节。 punpcklbw / punpckhbw 没有真正的倒数；打包指令做签名饱和而不是 t运行cation.

然而，最后的掩码 + 打包步骤只是 2x PAND，在输入上使用 set1_epi16(0x00FF) 来馈送一个 packsswb，并不太复杂。

您可以首先使用 Micro Optimization of a 4-bucket histogram of a large array or list 加快计算计数 - 从 set1(total/16) - counts[0..2] 推断 counts[3]（3 个 SIMD 减法循环结束，每次迭代保存 compare/sub）。

Answer 2

我尝试创建一个代码解决方案。然而代码给出了正确答案：最后是8。

不知何故我想数一数：255 中有多少个位于：
result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255]

我然后SumAbsoluteDifferences和UnpackHigh。

我想知道的是，我不知道这样做是否正确或可以，我除以：255？我这样做只是因为它用数学运算给出了正确答案：

int NUM = Convert.ToInt32(Sse2.Add(sum64[0], upper).ToScalar()) / 255;

    public unsafe static void SIMDfunction2()
    {
        //Create dummy values
        byte[,] array2D = new byte[9, 16]; byte num = 0;
        for (int i = 0; i < 9; i++)
        {
            for (int i2 = 0; i2 < 16; i2++)
            {
                array2D[i, i2] = num;
                if (num == 0) { num = 2; } else { num = 0; }
            }
        }


        /*------------------------------------------------------------*/
        unsafe
        {
            //Below starts SIMD calculations!
            fixed (byte* ptr = array2D)
            {
                //Add all 9 dimensions as Vector128
                List<Vector128<byte>> dimensionLIST = new List<Vector128<byte>>();
                for (int i = 0; i < 9; i++)
                {
                    byte* featuredimension = &*((byte*)(ptr + i * 16)); //This gives the first dimension with start: 0
                    dimensionLIST.Add(Avx.LoadVector128(&featuredimension[0])); //add "featuredimension" as a vector of the 16 next numbers: 
                }


                //Now count how many of: 0,2 are found in total in all "dimensionLIST" together?
                Span<Vector128<byte>> matches = stackalloc Vector128<byte>[1];
                Span<Vector128<UInt64>> hsums = stackalloc Vector128<UInt64>[1];
                byte nr2 = 2; byte nr3 = 9;
                for (int i = 0; i < dimensionLIST.Count; i++) //Each column
                {
                    //STEP 0
                    //Compare: dimensionLIST[i] with Vector128 val to find out how many matches of 2 in this loop
                    //[0,2,0,2,0,2,0,2,0,2,0,2,0,2,0,2], [2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
                    var match = Avx.CompareEqual(dimensionLIST[i], Vector128.Create(nr2)); //Create Vector128 for numbers: 2
                    matches[0] = Avx.Subtract(matches[0], match);
                }
                //STEP 1: Show result on how many 2s are found == 9 occurences of "2"!
                var result = Avx.CompareEqual(Vector128.Create(nr3), matches[0]); //matches[0]: [0,9,0,9,0,9,0,9,0,9,0,9,0,9,0,9] (In total 9 2s are found on those indexes)
                //result:[0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255] Puts 255 where integer == 9

                //STEP 2
                result = Avx.Subtract(Vector128<byte>.Zero, result);

                //STEP 3:
                hsums[0] = Sse2.SumAbsoluteDifferences(result, Vector128<byte>.Zero).AsUInt64();

                //STEP 4:
                Vector128<UInt64> upper = Sse2.UnpackHigh(hsums[0], hsums[0]).AsUInt64(); // punpckhqdq to extract high half

                //STEP 5:
                hsums[0] = Sse2.Add(hsums[0], upper); // paddd or paddw would be fine

                //STEP 6:
                UInt32 SUM = Convert.ToUInt32(hsums[0].ToScalar());

                //This returns: 8 which should be the correct SUM
                MessageBox.Show(SUM.ToString());
            }
        }
    }

如何使用 SIMD 计算在 4 个不同的 Vector128 之间找到最大值

How to find largest value between 4 different Vector128 using SIMD calculations

c#

arrays

sse

simd

avx

已编辑问题的答案：计算一个向量的水平匹配

原始/标题问题的答案，找出每个垂直 SIMD 元素中最大值来自 4 个向量中的哪一个