为什么 Core i5-6600 在非方矩阵乘法上比 Core i9-9960X 更快？

Question

以下最小基准在每台机器上用 -O3 -march=native 重建单线程代码，乘以方形或高度非方形（一维 = 2）的矩阵。

#include <Eigen/Core>

#include <chrono>
#include <iomanip>
#include <iostream>

std::string show_shape(const Eigen::MatrixXf& m)
{
    return "(" + std::to_string(m.rows()) + ", " + std::to_string(m.cols()) + ")";
}

void measure_gemm(const Eigen::MatrixXf& a, const Eigen::MatrixXf& b)
{
    typedef std::chrono::high_resolution_clock clock;
    const auto start_time_ns = clock::now().time_since_epoch().count();
    const std::size_t runs = 10;
    for (size_t i = 0; i < runs; ++i)
    {
        Eigen::MatrixXf c = a * b;
    }
    const auto end_time_ns = clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << std::setw(5) << elapsed_ms <<
        " ms <- " << show_shape(a) + " * " + show_shape(b) << std::endl;
}

int main()
{
    measure_gemm(Eigen::MatrixXf::Zero(2, 4096), Eigen::MatrixXf::Zero(4096, 16384));
    measure_gemm(Eigen::MatrixXf::Zero(1536, 1536), Eigen::MatrixXf::Zero(1536, 1536));
    measure_gemm(Eigen::MatrixXf::Zero(16384, 4096), Eigen::MatrixXf::Zero(4096, 2));
}

可以很容易地运行与 Dockerfile

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y build-essential wget cmake git lshw

RUN git clone -b '3.3.7' --single-branch --depth 1 https://github.com/eigenteam/eigen-git-mirror && cd eigen-git-mirror && mkdir -p build && cd build && cmake .. && make && make install && ln -s /usr/local/include/eigen3/Eigen /usr/local/include/Eigen

#ADD wide_vs_tall.cpp .
RUN wget https://gist.githubusercontent.com/Dobiasd/78b32fd4aa2fc83d8da3935d690c623a/raw/5626198a533473157d6a19a824f20ebe8678e9cf/wide_vs_tall.cpp
RUN g++ -std=c++14 -O3 -march=native wide_vs_tall.cpp -o main

ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache

RUN lscpu
RUN lshw -short -C memory

RUN ./main

wget https://gist.githubusercontent.com/Dobiasd/8e27e5a96989fa8e4f942900fe609998/raw/8a07fee1a015c8c8e47066a7ac92891850b70a14/Dockerfile
docker build --rm .

产生以下结果：

托比亚斯的工作站（`Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz`）

  359 ms <- (2, 4096) * (4096, 16384)
  761 ms <- (1536, 1536) * (1536, 1536)
  597 ms <- (16384, 4096) * (4096, 2)

sysbench --cpu-max-prime=20000 --num-threads=1 cpu run

CPU speed:
    events per second:   491.14

Keith 的工作站（`Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz`）

  437 ms <- (2, 4096) * (4096, 16384)
  724 ms <- (1536, 1536) * (1536, 1536)
  789 ms <- (16384, 4096) * (4096, 2)

sysbench --cpu-max-prime=20000 --num-threads=1 cpu run

CPU speed:
    events per second:   591.58

为什么 Tobias 的工作站在 3 个 GEMM 中有 2 个比 Keith 的工作站更快，尽管 Keith 的工作站显示出更好的 sysbench 结果？我预计 i9-9960X 会更快，因为它 -march=native 包含 AVX512，并且单核时钟速度更高。

Answer 1

正如所建议的那样，它似乎归结为内存吞吐量。

mbw 1000的结果显示：

i5-6600:

AVG Method: MEMCPY  Elapsed: 0.13856    MiB: 1000.00000 Copy: 7217.059 MiB/s
AVG Method: DUMB    Elapsed: 0.09008    MiB: 1000.00000 Copy: 11101.625 MiB/

i9-9960X:

AVG Method: MEMCPY  Elapsed: 0.14682    MiB: 1000.00000 Copy: 6811.131 MiB/s
AVG Method: DUMB    Elapsed: 0.10475    MiB: 1000.00000 Copy: 9546.631 MiB/s

为什么 Core i5-6600 在非方矩阵乘法上比 Core i9-9960X 更快？

Why is a Core i5-6600 faster at non-square matrix multiplication than a Core i9-9960X?

c++

performance

benchmarking

matrix-multiplication

eigen

托比亚斯的工作站（`Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz`）

Keith 的工作站（`Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz`）

为什么 Core i5-6600 在非方矩阵乘法上比 Core i9-9960X 更快？

Why is a Core i5-6600 faster at non-square matrix multiplication than a Core i9-9960X?

c++

performance

benchmarking

matrix-multiplication

eigen

托比亚斯的工作站（Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz）

Keith 的工作站（Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz）

托比亚斯的工作站（`Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz`）

Keith 的工作站（`Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz`）