为什么在 Skylake-Xeon 上写入 2 个缓存行的一部分时，“_mm_stream_si128”比“_mm_storeu_si128”慢得多？但对哈斯韦尔的影响较小

Question

我的代码看起来像这样（简单的加载、修改、存储）（我对其进行了简化以使其更具可读性）：

__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
  __m128i in = _mm_loadu_si128(inptr);
  __m128i out = in; // real code does more than this, but I've simplified it
  _mm_stream_si12(outptr,out);
  inptr  += 12;
  outptr += 16;
}

与我们较新的 Skylake 机器相比，此代码运行在我们较旧的 ~~Sandy Bridge~~ Haswell 硬件上快了大约 5 倍。例如，如果 while 循环运行s 大约 16e9 次迭代，则在 ~~Sandy Bridge~~ Haswell 上需要 14 秒，在 Skylake 上需要 70 秒。

我们在 Skylake 上升级到最新的微码，并且还停留在 vzeroupper 命令中以避免任何 AVX 问题。两个修复都没有效果。

outptr 对齐到 16 个字节，因此 stream 命令应该写入对齐的地址。（我检查以验证此声明）。 inptr 未按设计对齐。注释掉负载没有任何效果，限制命令是商店。 outptr 和 inptr 指向不同的内存区域，没有重叠。

如果我将 _mm_stream_si128 替换为 _mm_storeu_si128，则代码运行在两台机器上都快得多，大约 2.9 秒。

所以这两个问题是

1) 为什么 ~~Sandy Bridge~~ Haswell 和 Skylake 在使用 _mm_stream_si128 intrinsic 编写时会有如此大的差异？

2) 为什么 _mm_storeu_si128 运行比等效的流媒体快 5 倍？

我是内在函数的新手。

附录 - 测试用例

这是整个测试用例：https://godbolt.org/z/toM2lB

以下是我对两种不同处理器 E5-2680 v3 (Haswell) 和 8180 (Skylake) 进行的基准测试的总结。

// icpc -std=c++14  -msse4.2 -O3 -DNDEBUG ../mre.cpp  -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
// The command line was
//    perf stat ./mre 100000
//
//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     1.65   7.29
//   _mm_storeu_si128     0.41   0.40

stream与store的比率分别为4x或18x。

我依靠默认的 new 分配器将我的数据对齐到 16 字节。我在这里很幸运，它是对齐的。我已经测试过这是真的，并且在我的生产应用程序中，我使用对齐的分配器来绝对确保它是正确的，并检查了地址，但我把它从示例中删除了，因为我认为这不重要.

第二次编辑 - 64B 对齐输出

@Mystical 的评论让我检查输出是否全部缓存对齐。对 Tile 结构的写入是在 64-B 块中完成的，但 Tile 本身不是 64-B 对齐的（仅 16-B 对齐）。

所以把我的测试代码改成这样：

#if 0
    std::vector<Tile> tiles(outputPixels/32);
#else
    std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif

现在的数字大不相同了：

//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     0.19   0.48
//   _mm_storeu_si128     0.25   0.52

所以一切都快多了。但是 Skylake 仍然比 Haswell 慢 2 倍。

第三次编辑。故意错位

我尝试了@HaidBrais 建议的测试。我特意将我的向量 class 分配为 64 字节对齐，然后在分配器中添加 16 字节或 32 字节，以便分配为 16 字节或 32 字节对齐，但不是 64 字节对齐。我也把循环次数增加到1,000,000次，运行测试了3次，选了最小的一次。

perf stat ./mre1  1000000

重申一下，2^N 的对齐意味着它不与 2^(N+1) 或 2^(N+2) 对齐。

//   STORER               alignment time (seconds)
//                        byte  E5-2680   8180
// ---------------------------------------------------
//   _mm_storeu_si128     16       3.15   2.69
//   _mm_storeu_si128     32       3.16   2.60
//   _mm_storeu_si128     64       1.72   1.71
//   _mm_stream_si128     16      14.31  72.14 
//   _mm_stream_si128     32      14.44  72.09 
//   _mm_stream_si128     64       1.43   3.38

所以很明显缓存对齐给出了最好的结果，但是 _mm_stream_si128 只在 2680 处理器上更好并且在 8180 上遭受了某种我无法解释的惩罚。

为了将来使用，这是我使用的错位分配器（我没有将错位模板化，您必须编辑 32 并更改为 0 或 16根据需要）：

template <class T >
struct Mallocator {
  typedef T value_type;
    Mallocator() = default;
      template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept 
{}
        T* allocate(std::size_t n) {
                if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
                    uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
                    if(! p1) throw std::bad_alloc();
                    p1 += 32; // misalign on purpose
                    return reinterpret_cast<T*>(p1);
                          }
          void deallocate(T* p, std::size_t) noexcept {
              uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
              p1 -= 32;
              std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }

...

std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

Answer 1

简化的代码并没有真正显示基准测试的实际结构。我不认为简化的代码会表现出你提到的缓慢。

你的 Godbolt 代码的实际循环是：

while (count > 0)
        {
            // std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
            __m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
            __m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
            __m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
            __m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));

            __m128i tileVal0 = value0;
            __m128i tileVal1 = value1;
            __m128i tileVal2 = value2;
            __m128i tileVal3 = value3;

            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);

            ptr    += diffBytes * 4;
            count  -= diffBytes * 4;
            tile   += diffPixels * 4;
            ipixel += diffPixels * 4;
            if (ipixel == 32)
            {
                // go to next tile
                ipixel = 0;
                tileIter++;
                tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
            }
        }

注意 if (ipixel == 32) 部分。每次 ipixel 达到 32 时都会跳转到不同的图块。由于 diffPixels 是 8，因此每次迭代都会发生这种情况。因此，每个图块仅制作 4 个流式存储（64 字节）。除非每个图块恰好是 64 字节对齐的（这不太可能偶然发生并且不能依赖），否则这意味着每次写入都只写入两个不同缓存行的一部分。这是流媒体商店的一个已知反模式：为了有效使用流媒体商店，您需要写出整行。

关于性能差异：流媒体商店在不同硬件上的性能差异很大。这些存储总是占用行填充缓冲区一段时间，但时间长短各不相同：在许多客户端芯片上，它似乎只占用大约 L3 延迟的缓冲区。也就是说，一旦流媒体存储到达 L3，它就可以被移交（L3 将跟踪其余的工作）并且 LFB 可以在核心上被释放。服务器芯片通常有更长的延迟。尤其是多路主机。

显然，NT 存储在 SKX 盒上的性能更差，更部分行写入更差。整体性能较差可能与三级缓存的重新设计有关。

为什么在 Skylake-Xeon 上写入 2 个缓存行的一部分时，“_mm_stream_si128”比“_mm_storeu_si128”慢得多？但对哈斯韦尔的影响较小

Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

performance

x86

sse

intel

intrinsics