我应该何时以及如何在我的 simd 例程中执行浮点转换？

Question

我正在计算 2 个图像的双向（水平和垂直）前缀和（扫描），产生像素和、平方和以及两个图像的叉积。所有的计算都是在 32 位整数中完成的，然后我进入最终阶段，需要将 32 位整数转换为双精度数以计算窗口函数中两个图像的均值、方差和协方差。

首先 -- 这是执行此操作的最佳方法吗？我可以在双精度中构建整个前缀和数组，并且没有转换步骤。

其次——如果这是正确的方法，使用打包的双 simd 操作是否对我有很多好处？我只能安全地假设我会一次拿出 2 个单位。

第三 -- 我应该将数据单元打包在一起还是保留其当前所在的平面格式？ [平面格式是像素被 'component' 分割的格式。如果您获得 32 位 RGBA 输入，即 8 位 R、8 位 G、8 位 B 和 8 位 A，则打包格式为 RGBARGBA，而平面格式为 RRRRRRRRRRRRR....GGGGGGGGGGG....BBBBB.. ...AAAAA...等等。]

以下是我到目前为止完成的与此主题相关的三个功能。前 2 个是标量版本，因此更容易阅读和理解正在发生的事情。第三个是功能 1 的当前 SIMD 实现。第四个功能（缺少且尚未完成）是这个问题的主题，可能是第二个的 SIMD 实现。

std::unique_ptr<uint32_t[],boost::alignment::aligned_delete> computeSumMatrixForwardScalar2PassAll(uint8_t const* pImgData1, uint8_t const* pImgData2,
                                                                                                unsigned width, unsigned height)
{
  using namespace simdpp;
  std::unique_ptr<uint32_t[], boost::alignment::aligned_delete> sumArray((uint32_t*)boost::alignment::aligned_alloc(64, 5*width*height*sizeof(uint32_t)));
  auto pSumArray = sumArray.get();
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData1, 64);
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData2, 64);
  BOOST_ALIGN_ASSUME_ALIGNED(pSumArray, 64);
//#pramga omp parallel for private(h) shared(pImgData, pSumArray, w )

#pragma omp for simd
  for (unsigned h = 0; h < height; ++h)
  {
    uint32_t lastValX = 0;
    uint32_t lastValY = 0;
    uint32_t lastValXX = 0;
    uint32_t lastValYY = 0;
    uint32_t lastValXY = 0;
    for (unsigned w = 0; w < width; ++w)
    {
      uint32_t imgValX      = pImgData1[h * width + w];
      uint32_t newValX      = lastValX + imgValX;
      uint32_t newValXX = lastValXX + imgValX*imgValX;
      uint32_t imgValY = pImgData2[h*width + w];
      uint32_t newValY = lastValY + imgValY;
      uint32_t newValYY = lastValYY + imgValY*imgValY;
      uint32_t newValXY = lastValXY + imgValX*imgValY;
      pSumArray[h*width + w]= newValX;
      pSumArray[width*height+h*width + w] = newValY;
      pSumArray[2*width*height+ h*width + w] = newValXX;

      pSumArray[3*width*height+h*width + w] = newValYY;
      pSumArray[4*width*height+h*width + w] = newValXY;
      lastValX              = newValX;
      lastValXX = newValXX;
      lastValY = newValY;
      lastValYY = newValYY;
      lastValXY = newValXY;
    }
  }
  for (unsigned i = 0; i < 5; ++i) {
    for (unsigned h = 0; h+1 < height; ++h)
    {
      for (unsigned w = 0; w < width; ++w) {
        uint32_t above = pSumArray[i*width*height + h * width + w];
        uint32_t current = pSumArray[i*width*height+ (h+1) *width +w];
        pSumArray[i*width*height + (h+1) * width +w]= above+current;
      }
    }
  }

  return sumArray;
}

第二：SSIM 转换——注意不同的语言——因为我还没有完成它的 C++ 实现。注意，它在里面调用了 weberSumMatrix，和上面的函数是一样的。

export function weberSsim(
  pixels1: ImageMatrix,
  pixels2: ImageMatrix,
  options: Options
): MSSIMMatrix {
  // console.time("weber ssim");
  const { bitDepth, k1, k2, windowSize} = options
  const L = (1 << bitDepth) - 1
  const c1 = k1 * L * (k1 * L)
  const c2 = k2 * L * (k2 * L)
  const windowSquared = windowSize * windowSize
  const pixels1Data = pixels1.data;
  const pixels2Data = pixels2.data;
  const width = pixels1.width;
  const height = pixels1.height;
  // Produces exactly the same output as the C++ prefix sum above.
  const sumMatrix = weberSumMatrix(pixels1Data, pixels2Data, width, height);
  const windowHeight = height-windowSize;
  const windowWidth = width-windowSize;
  const imageSize = width*height;
  const ssims = new Array(windowHeight*windowWidth);


  // lets handle w = 0 h = 0 first and initialize mssim

  let cumulativeSsim;
  const reciprocalWindowSquared =  1 / windowSquared;
  {
    const windowOffset = windowSize - 1;
    let bottomOffset = windowOffset*width;
    {
      const meanx = (sumMatrix[bottomOffset+ windowOffset]) * reciprocalWindowSquared;
      const meany = (
        sumMatrix[imageSize + bottomOffset+ windowOffset]) * reciprocalWindowSquared;
      const varx = (
        sumMatrix[2*imageSize + bottomOffset+ windowOffset]) * reciprocalWindowSquared - meanx*meanx ;
      const vary = (
        sumMatrix[3*imageSize + bottomOffset+ windowOffset])  * reciprocalWindowSquared - meany*meany;
      const cov = (
        sumMatrix[4*imageSize + bottomOffset+ windowOffset])  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da * db)
      ssims[0] = ssim
      // mssim = ssim
      cumulativeSsim = ssim;
    }



    // next handle all of the h = 0, w > 0 cases first
    for (let w = 1; w <  windowWidth; ++w) {
      // in h =0 cases, there is no top left or top right
      let leftOffset = w - 1;
      const rightx = sumMatrix[bottomOffset+leftOffset];
      const leftx = sumMatrix[bottomOffset+(windowOffset+w)];
      const meanx = (leftx-rightx)* reciprocalWindowSquared;
      const righty= sumMatrix[imageSize + bottomOffset+ leftOffset];
      const lefty = sumMatrix[imageSize + bottomOffset+ (windowOffset+w)];
      const meany = (lefty-righty) * reciprocalWindowSquared;
      const rightxx = sumMatrix[2*imageSize + bottomOffset+leftOffset];
      const leftxx = sumMatrix[2*imageSize + bottomOffset+ (windowOffset+w)];
      const varx = (leftxx-rightxx) * reciprocalWindowSquared - meanx*meanx ;
      const rightyy = sumMatrix[3*imageSize + bottomOffset+leftOffset];
      const leftyy = sumMatrix[3*imageSize + bottomOffset+ (windowOffset+w)]
      const vary = (leftyy - rightyy)  * reciprocalWindowSquared - meany*meany;
      const rightxy = sumMatrix[4*imageSize + bottomOffset+leftOffset];
      const leftxy = sumMatrix[4*imageSize + bottomOffset+ (windowOffset+w)];
      const cov = (leftxy-rightxy)  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da *db)
      ssims[w] = ssim
      // mssim = mssim + (ssim - mssim) / (i + 1)
      cumulativeSsim += ssim;
    }
  }

  const windowOffset = windowSize - 1;
  // There will be lots of branch misses if we don't split the w==0 and h==0 cases
  for (let h = 1; h < windowHeight; ++h) {
    // now the w=0 on each line
    let bottomOffset = (h+windowSize-1)*width;
    let topOffset = (h-1)*width;
    {
      // since there is no left side we can skip two operations
      const topx = sumMatrix[topOffset+ windowOffset];
      const bottomx = sumMatrix[bottomOffset+ windowOffset];
      const meanx = (bottomx - topx) * reciprocalWindowSquared;
      const topy = sumMatrix[imageSize + topOffset+ windowOffset];
      const bottomy = sumMatrix[imageSize + bottomOffset+ windowOffset];
      const meany = (bottomy - topy) * reciprocalWindowSquared;
      const topxx = sumMatrix[2*imageSize + topOffset+ windowOffset];
      const bottomxx = sumMatrix[2*imageSize + bottomOffset+ windowOffset];
      const varx = (bottomxx-topxx)  * reciprocalWindowSquared - meanx*meanx ;
      const topyy = sumMatrix[3*imageSize + topOffset+ windowOffset];
      const bottomyy = sumMatrix[3*imageSize + bottomOffset+ windowOffset];
      const vary = (bottomyy-topyy)  * reciprocalWindowSquared - meany*meany;
      const topxy = sumMatrix[4*imageSize + topOffset+ windowOffset];
      const bottomxy = sumMatrix[4*imageSize + bottomOffset+ windowOffset];
      const cov = (bottomxy-topxy)  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da *db)
      ssims[h*windowWidth] = ssim
      // mssim = mssim + (ssim - mssim) / (i + 1)
      cumulativeSsim += ssim;
    }


    for (let w = 1; w < windowWidth; ++w) {
      // add top left sub top right sub bottom left add bottom right
      const rightOffset = w + windowSize - 1;
      const leftOffset = w - 1;
      const meanx = (sumMatrix[topOffset + leftOffset]
        - sumMatrix[topOffset+ rightOffset]
        - sumMatrix[bottomOffset+leftOffset]
        + sumMatrix[bottomOffset+ rightOffset]) * reciprocalWindowSquared;
      const meany = (sumMatrix[imageSize+ topOffset + leftOffset]
        - sumMatrix[imageSize + topOffset+ rightOffset]
        - sumMatrix[imageSize + bottomOffset+leftOffset]
        + sumMatrix[imageSize + bottomOffset+ rightOffset]) * reciprocalWindowSquared;
      const varx = (sumMatrix[2*imageSize+ topOffset + leftOffset]
        - sumMatrix[2*imageSize + topOffset+ rightOffset]
        - sumMatrix[2*imageSize + bottomOffset+leftOffset]
        + sumMatrix[2*imageSize + bottomOffset+ rightOffset]) * reciprocalWindowSquared - meanx*meanx ;
      const vary = (sumMatrix[3*imageSize+ topOffset + leftOffset]
        - sumMatrix[3*imageSize + topOffset+ rightOffset]
        - sumMatrix[3*imageSize + bottomOffset+leftOffset]
        + sumMatrix[3*imageSize + bottomOffset+ rightOffset])  * reciprocalWindowSquared - meany*meany;
      const cov = (sumMatrix[4*imageSize+ topOffset + leftOffset]
        - sumMatrix[4*imageSize + topOffset+ rightOffset]
        - sumMatrix[4*imageSize + bottomOffset+leftOffset]
        + sumMatrix[4*imageSize + bottomOffset+ rightOffset])  * reciprocalWindowSquared - meanx*meany;
      const na = 2 * meanx * meany + c1
      const nb = 2 * cov + c2
      const da = meanx * meanx + meany * meany + c1
      const db = varx + vary + c2
      const ssim = (na * nb) / (da * db)
      ssims[h*windowWidth+w] = ssim
      cumulativeSsim += ssim;
      // mssim = mssim + (ssim - mssim) / (i + 1)
    }
  }
  const mssim = cumulativeSsim / (windowHeight*windowWidth);


  return { data: ssims, width, height, mssim }
}

第三：当前SIMD前缀总和。

std::unique_ptr<uint32_t[],boost::alignment::aligned_delete> computeSumMatrixForwardSimd2PassAll(uint8_t const* pImgData1, uint8_t const* pImgData2,
                                                                                              unsigned width, unsigned height)
{
  using namespace simdpp;
  std::unique_ptr<uint32_t[], boost::alignment::aligned_delete> sumArray((uint32_t*)boost::alignment::aligned_alloc(64, 5*width*height*sizeof(uint32_t)));
  auto pSumArray = sumArray.get();
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData1, 64);
  BOOST_ALIGN_ASSUME_ALIGNED(pImgData2, 64);
  BOOST_ALIGN_ASSUME_ALIGNED(pSumArray, 64);
//#pramga omp parallel for private(h) shared(pImgData, pSumArray, w )


  uint32x4 zero = make_zero();
  for (unsigned h = 0; h < height; ++h)
  {
    uint32x4 lastValSplatX = zero;
    uint32x4 lastValSplatY = zero;
    uint32x4 lastValSplatXX = zero;
    uint32x4 lastValSplatYY = zero;
    uint32x4 lastValSplatXY = zero;
    for (unsigned w = 0; w < width; w += 16)
    {
      // starting left value
      // previous line values..
      prefetch_read(pImgData1+(w+1)*64);
      prefetch_read(pImgData2+(w+1)*64);
      uint32v4 imgDataX = to_uint32(uint8x16(load(pImgData1 + h * width + w)));
      uint32v4 imgDataY = to_uint32(uint8x16(load(pImgData2 + h * width + w)));
      static_assert(uint32v4::vec_length == 4);
      static_assert(sizeof(uint32v4::base_vector_type::native_type) == 16);
      for (unsigned i = 0 ; i < uint32v4::vec_length; ++i) {
        // a_0 a_1 a_2 a_3
        uint32v4::base_vector_type   x = imgDataX.vec(i);
        uint32v4::base_vector_type  y = imgDataY.vec(i);
        uint32v4::base_vector_type xx = mul_lo(x,x);
        uint32v4::base_vector_type yy = mul_lo(y, y);
        uint32v4::base_vector_type xy = mul_lo(x, y);
        // a_0 a_0+a_1 a_1+a_2 a_2+a_3
        x = add(x, move4_r<1>(x));
        x = add(x, move4_r<2>(x));
        x = add(x, lastValSplatX);
        lastValSplatX = permute4<3,3,3,3>(x);
        store(pSumArray+h*width+w+i*4, x);
        y = add(y, move4_r<1>(y));
        y = add(y, move4_r<2>(y));
        y = add(y, lastValSplatY);
        lastValSplatY = permute4<3,3,3,3>(y);
        store(width*height+pSumArray+h*width+w+i*4, y);
        xx = add(xx, move4_r<1>(xx));
        xx = add(xx, move4_r<2>(xx));
        xx = add(xx, lastValSplatXX);
        lastValSplatXX = permute4<3,3,3,3>(xx);
        store(2*width*height+pSumArray+h*width+w+i*4, xx);
        yy = add(yy, move4_r<1>(yy));
        yy = add(yy, move4_r<2>(yy));
        yy = add(yy, lastValSplatYY);
        lastValSplatYY = permute4<3,3,3,3>(yy);
        store(3*width*height+pSumArray+h*width+w+i*4, yy);
        xy = add(xy, move4_r<1>(xy));
        xy = add(xy, move4_r<2>(xy));
        xy = add(xy, lastValSplatXY);
        lastValSplatXY = permute4<3,3,3,3>(xy);
        store(4*width*height+pSumArray+h*width+w+i*4, xy);
      }
    }
  }
  // 16 bit 8s for grins...
  // a_0 a_1 a_2 a_3 a_4 a_5 a_6 a_7
  // a_0 a_0+a_1 a_1+a_2 a_2+a_3 a_3+a_4 a_4+a_5 a_5+a_6 a_6+a_7 (>>1)
  // d    d        -a_0  -a_0+a_1 -a_0+a_1+a_2 -a_0+a_1+a_2+a_3 -a_0+a_1+a_2+a_3+a_4 - -a_0+a_1+a_2+a_3+a_4+a_5
  // d    d        shuffle shuffle shuffle+add    shuffle+add     shuffle+add+shuffle+add  shuffle+add+shuffle+add


  // a_0 a_1 a_2 a_3     a_4         a_5           a_6                    a_7
  //         a_0 a_1     a_2         a_3           a_4                    a_5
  //         a_1 a_2+a_0 a_3+a_1-a_0 a_4+a_2-a1-a0 a_5+a_3-a_2-a_0        a_6+a_4...
  for (unsigned i = 0; i < 5; ++i) {
    for (unsigned h = 0; h+1 < height; ++h)
    {
      for (unsigned w = 0; w < width; w += 16) {
        uint32v4 above = load(i*width*height +pSumArray + h * width + w);
        uint32v4 current = load(i*width*height+pSumArray +(h+1) * width +w);
        store(i*width*height+pSumArray +(h+1) * width +w, add(above,current));
      }
    }
  }

  return sumArray;
}

Answer 1

对于第 3 部分——至少，英特尔® 64 位和 IA-32 架构优化参考手册指定平面格式通常更为理想。所以我可以越过 one-off。看起来我可以做 floating-point 值而不是双倍 windows 高达 16x16 (255*256**2) 因为它在溢出之前是可表示的。从这个角度来看，我得到了每个向量 4 个浮点数而不是 2 个双精度浮点数的好处。这很有帮助。看来我可以从四个浮点数中得到我的平均计算的滚动总和并在最后计算它。因此，如果我根据最大 window 大小限制函数，我可以确定何时使用 float 以及何时使用 double。实现并没有太大的不同。

我想在我尝试看看之前我不会知道它的真正性能。

Answer 2

First -- is this the optimal way to do this? I could build the entire prefix sum array in doubles and there would be no conversion step.

通常，整数计算比浮点计算快得多，每个向量的元素数量相同。例如，在 Skylake 上，padd 的延迟为 1，相互吞吐量为 0.33 个周期，addps 分别为 4 和 0.5。对于整数计算，如果您有乘法，有时会因为必须在不同的元素大小之间进行转换或合并乘积的上半部分和下半部分而产生开销。但通常这仍然比 FP 更快，如果您能够合并 pmaddwd 或 pmaddubsw 指令（FMA 的 INT 版本），您也可以减少它，或者您可以丢弃乘积的下半部分或上半部分。

说到 FMA，FP 因为 AVX2 具有 FMA 指令的优势，它允许每次乘法免费进行一次加法。这对你的情况是否有益取决于你的算法和输入数据，但如果你有整数输入，我仍然希望尽可能地对其进行 INT 计算。

INT 计算相对于 FP 的另一个优势是您可以对更小的元素进行计算，这意味着您可以每条指令处理更多数据。当然，这只有在你的输入和算法允许的情况下才有可能，但特别是在图像处理中，这种情况经常发生。

关于这部分的最后一点注意事项是，您应该考虑在算法的每个阶段需要处理的数据量。 INT 到 FP 的转换不是免费的，因此您需要转换的数据越少越好。如果您的中间数据量少于输入量，那么推迟转换将是有益的。

Second -- if this is the right way, do I get many benefits from using the packed double simd operations? I can only safely assume I'll get 2 units out at a time.

好吧，每条指令 2 个双倍比一个好，所以在我看来这是值得的。与标量相比，您还可以对向量执行更精细的操作，例如 FMA、掩码、混合 min/max（尽管编译器甚至可以在标量代码上为您生成一些指令）。如果您在运行时检测到 AVX，您可以机会性地将吞吐量加倍。

此外，您应该考虑 32 位浮点精度在您的情况下是否足够。您可能不需要双精度结果，并且通过使用一些 FP 技术，您可以减少使用 32 位 FP 数学累积的错误，并且仍然具有比 64 位计算更好的性能。

Third -- Should I pack the data units together or leave it in the planar format it's currently in?

一般来说，您应该更喜欢平面输入数据。一般而言，SIMD 尤其是 SSE/AVX 更适合垂直操作（即当操作在不同向量的相应元素之间执行时）而不是水平操作（当操作在同一向量的元素之间执行时）。对于打包输入，您可能必须对输入数据执行解包和混洗，这会增加开销。现代 CPU 能够跟踪多个读取或写入流 from/to 内存，因此硬件预取器应该能够处理不同数据平面的线性内存访问。

Answer 3

关于第 1 部分，根据整数的计数及其范围，您可能更适合使用 64 位整数而不是双精度数。一般来说，特别是延迟要快一点，也更精确。这是SSE2中的一些片段。

// Return acc64 += a32. The a32 is viewed as uint32_t lanes, the accumulator is uint64_t
inline __m128i integerAdd( __m128i a32, __m128i acc64 )
{
    const __m128i low = _mm_and_si128( a32, _mm_set1_epi64x( UINT_MAX ) );
    acc64 = _mm_add_epi64( acc64, low );
    const __m128i high = _mm_srli_epi64( a32, 32 );
    acc64 = _mm_add_epi64( acc64, high );
    return acc64;
}

// Compute a32 * b32, add to the accumulator. The two inputs are viewed as uint32_t lanes, the accumulator is uint64_t
inline __m128i integerFma( __m128i a32, __m128i b32, __m128i acc64 )
{
    const __m128i low = _mm_mul_epu32( a32, b32 );
    a32 = _mm_srli_si128( a32, 4 );
    b32 = _mm_srli_si128( b32, 4 );
    const __m128i high = _mm_mul_epu32( a32, b32 );
    acc64 = _mm_add_epi64( acc64, low );
    acc64 = _mm_add_epi64( acc64, high );
    return acc64;
}

// Add both 64-bit lanes of the accumulator, convert to double
inline double accumulatedValue( __m128i acc64 )
{
    acc64 = _mm_add_epi64( acc64, _mm_unpackhi_epi64( acc64, acc64 ) );
    const uint64_t v = (uint64_t)_mm_cvtsi128_si64( acc64 );
    return (double)v;
}

更新：平均值为 sum(x) / count。你不需要分数来计算整数和，如果你有很多值，你只需要 64 位整数来避免溢出。一条最终指令只需要 non-integer 个数字。

对于您需要的方差 sum((x-mean)^2)。这可以是 re-written 作为 sum(x^2) + mean * ( mean - 2 * sum(x) )。现在很明显，您只需要对输入数据进行一次传递，计算平方和和值的和。如果输入太多且太大，计算平方和可能会溢出 64 位整数。但对于许多实际应用来说，这很好，例如如果您的输入在 [0 .. 65535] 范围内，64 位整数仅在 40 亿个样本后溢出，您可能没有那么多。

我应该何时以及如何在我的 simd 例程中执行浮点转换？

When and how should I perform floating point conversion in my simd routine?

x86

assembly

sse

simd

webassembly