SSE _mm_dp_ps 尺寸结果

Question

我开始用SSE做操作了。我想用 _mm_dp_ps 生成两个点积并将第一个结果保存在 aux_sse 中，第二个结果保存在 aux_sse 中。 B 是值为 1 的 8 元素向量。

因为每对只需要两个浮点数，所以我完成了以下代码：

    printf("A  \n");
    for(i = 0; i < M; i++){
        for(j = 0; j < ele; j++){
            A[i*ele+j] = i*ele+j;
            printf(" %f ", A[i*ele+j]);
        }
        printf("\n");
    }
    
    float aux[ele*M];
    float aux2[ele*M];
    __m128 *A_sse = (__m128*) A;
    __m128 *B_sse = (__m128*) B;
    __m128 *aux_sse  = (__m128*) aux;
    __m128 *aux2_sse  = (__m128*) aux2;
    for(int i = 0; i < M; i++)
    {
        *aux_sse =  _mm_dp_ps (*A_sse, *B_sse,  0xFF);
        printf("%f \n", aux[i]);

        B_sse ++;
        A_sse++;
        *aux2_sse = _mm_dp_ps (*A_sse, *B_sse,  0xFF);
        printf("%f \n", aux2[i]);

        B_sse --;
        A_sse ++;

        aux_sse+= sizeof(char);
        aux2_sse+= sizeof(char);
    }

我得到以下错误输出：

A  
 0.000000  1.000000  2.000000  3.000000  4.000000  5.000000  6.000000  7.000000 
 8.000000  9.000000  10.000000  11.000000  12.000000  13.000000  14.000000  15.000000 
6.000000 
22.000000 
6.000000 
22.000000

根据this：

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

我了解到在 imm8 中我们在元素中指定我们希望保存结果。

据我了解，即使结果在输出向量的4个元素中，如果我只用aux_sse+= sizeof(char)增加一个元素，结果应该被覆盖并且会得到想要的结果出去。但是，我发现情况并非如此。

如果我在打印 aux 和 aux2 的结果时进行以下修改，则输出是正确的。

printf("%f \n", aux[i*4]);
printf("%f \n", aux2[i*4]);

输出：

我正在使用 gcc 编译器。有谁知道问题出在哪里？任何答案都会有所帮助。

编辑：

我需要aux和aux2的元素对应每次迭代：

aux[i] = dot_product 在迭代 i

中执行

Answer 1

aux_sse+= sizeof(char); 是一种荒谬的写法 aux_sse+=1，即前进 16 个字节，也就是 4 个浮点数，即 sizeof(__m128) == sizeof(*aux_sse) == 16.

因此，如果您还通过浮点数索引访问数组，是的，如果您仅将 i 每个 4 个浮点数的向量递增 1，则必须将其缩放 4。

通常情况下，使用 _mm_store_ps(&aux[i], v); 而不是跟踪 __m128* 变量来访问相同的数组会更容易。并且 i+=4 所以 i 实际上是索引你拥有的 4 元素组的开始，而不是需要缩放它。这使得编写像 i < M-3.

这样的循环边界变得更容易

另请注意，如果要对数组进行 alignment-required 访问，则应使用 alignas(16) float aux[ele*M];。 GCC 会注意到您在做什么，并会在看到数组的使用方式时为您对齐数组，但一般不要指望它。

或者您是否只想存储一个 float 结果，而不是为每组 4 个输入存储 4 个相同的 dot-products？在那种情况下，您应该提取低标量元素，例如_mm_store_ss (&aux[i], v)。或者 _mm_cvtss_f32(v) 将向量的低元素作为标量 float.

如果需要，您可以手动执行四个 4 元素点积，生成 1 个包含 4 个结果的向量。 _mm_mul_ps 然后可能是 2x _mm_hadd_ps (SSE3) 水平减少，一种转置和添加。（由@mainactual 建议）

dpps 在 Skylake 和类似的 Intel CPU (https://uops.info) 上是 4 微指令，所以如果你有多个点积要做，那就不太好了。

为了避免 SSE3，您可以使用 _mm_shuffle_ps (shufps) 从 2 个向量中选取元素，或者某些 _mm_unpacklo_ps / unpackhi 可能有用，或者 pd 将元素对保持在一起的版本。

SSE _mm_dp_ps 尺寸结果

SSE _mm_dp_ps size result

c

sse

intrinsics