为什么 Core i5-6600 在非方矩阵乘法上比 Core i9-9960X 更快?
Why is a Core i5-6600 faster at non-square matrix multiplication than a Core i9-9960X?
以下最小基准在每台机器上用 -O3 -march=native
重建单线程代码,乘以方形或高度非方形(一维 = 2)的矩阵。
#include <Eigen/Core>
#include <chrono>
#include <iomanip>
#include <iostream>
std::string show_shape(const Eigen::MatrixXf& m)
{
return "(" + std::to_string(m.rows()) + ", " + std::to_string(m.cols()) + ")";
}
void measure_gemm(const Eigen::MatrixXf& a, const Eigen::MatrixXf& b)
{
typedef std::chrono::high_resolution_clock clock;
const auto start_time_ns = clock::now().time_since_epoch().count();
const std::size_t runs = 10;
for (size_t i = 0; i < runs; ++i)
{
Eigen::MatrixXf c = a * b;
}
const auto end_time_ns = clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << std::setw(5) << elapsed_ms <<
" ms <- " << show_shape(a) + " * " + show_shape(b) << std::endl;
}
int main()
{
measure_gemm(Eigen::MatrixXf::Zero(2, 4096), Eigen::MatrixXf::Zero(4096, 16384));
measure_gemm(Eigen::MatrixXf::Zero(1536, 1536), Eigen::MatrixXf::Zero(1536, 1536));
measure_gemm(Eigen::MatrixXf::Zero(16384, 4096), Eigen::MatrixXf::Zero(4096, 2));
}
可以很容易地 运行 与 Dockerfile
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y build-essential wget cmake git lshw
RUN git clone -b '3.3.7' --single-branch --depth 1 https://github.com/eigenteam/eigen-git-mirror && cd eigen-git-mirror && mkdir -p build && cd build && cmake .. && make && make install && ln -s /usr/local/include/eigen3/Eigen /usr/local/include/Eigen
#ADD wide_vs_tall.cpp .
RUN wget https://gist.githubusercontent.com/Dobiasd/78b32fd4aa2fc83d8da3935d690c623a/raw/5626198a533473157d6a19a824f20ebe8678e9cf/wide_vs_tall.cpp
RUN g++ -std=c++14 -O3 -march=native wide_vs_tall.cpp -o main
ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache
RUN lscpu
RUN lshw -short -C memory
RUN ./main
wget https://gist.githubusercontent.com/Dobiasd/8e27e5a96989fa8e4f942900fe609998/raw/8a07fee1a015c8c8e47066a7ac92891850b70a14/Dockerfile
docker build --rm .
产生以下结果:
托比亚斯的工作站(Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
)
359 ms <- (2, 4096) * (4096, 16384)
761 ms <- (1536, 1536) * (1536, 1536)
597 ms <- (16384, 4096) * (4096, 2)
sysbench --cpu-max-prime=20000 --num-threads=1 cpu run
CPU speed:
events per second: 491.14
Keith 的工作站(Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
)
437 ms <- (2, 4096) * (4096, 16384)
724 ms <- (1536, 1536) * (1536, 1536)
789 ms <- (16384, 4096) * (4096, 2)
sysbench --cpu-max-prime=20000 --num-threads=1 cpu run
CPU speed:
events per second: 591.58
为什么 Tobias 的工作站在 3 个 GEMM 中有 2 个比 Keith 的工作站更快,尽管 Keith 的工作站显示出更好的 sysbench 结果?我预计 i9-9960X 会更快,因为它 -march=native
包含 AVX512,并且单核时钟速度更高。
正如所建议的那样,它似乎归结为内存吞吐量。
mbw 1000
的结果显示:
i5-6600
:
AVG Method: MEMCPY Elapsed: 0.13856 MiB: 1000.00000 Copy: 7217.059 MiB/s
AVG Method: DUMB Elapsed: 0.09008 MiB: 1000.00000 Copy: 11101.625 MiB/
i9-9960X
:
AVG Method: MEMCPY Elapsed: 0.14682 MiB: 1000.00000 Copy: 6811.131 MiB/s
AVG Method: DUMB Elapsed: 0.10475 MiB: 1000.00000 Copy: 9546.631 MiB/s
以下最小基准在每台机器上用 -O3 -march=native
重建单线程代码,乘以方形或高度非方形(一维 = 2)的矩阵。
#include <Eigen/Core>
#include <chrono>
#include <iomanip>
#include <iostream>
std::string show_shape(const Eigen::MatrixXf& m)
{
return "(" + std::to_string(m.rows()) + ", " + std::to_string(m.cols()) + ")";
}
void measure_gemm(const Eigen::MatrixXf& a, const Eigen::MatrixXf& b)
{
typedef std::chrono::high_resolution_clock clock;
const auto start_time_ns = clock::now().time_since_epoch().count();
const std::size_t runs = 10;
for (size_t i = 0; i < runs; ++i)
{
Eigen::MatrixXf c = a * b;
}
const auto end_time_ns = clock::now().time_since_epoch().count();
const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
std::cout << std::setw(5) << elapsed_ms <<
" ms <- " << show_shape(a) + " * " + show_shape(b) << std::endl;
}
int main()
{
measure_gemm(Eigen::MatrixXf::Zero(2, 4096), Eigen::MatrixXf::Zero(4096, 16384));
measure_gemm(Eigen::MatrixXf::Zero(1536, 1536), Eigen::MatrixXf::Zero(1536, 1536));
measure_gemm(Eigen::MatrixXf::Zero(16384, 4096), Eigen::MatrixXf::Zero(4096, 2));
}
可以很容易地 运行 与 Dockerfile
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y build-essential wget cmake git lshw
RUN git clone -b '3.3.7' --single-branch --depth 1 https://github.com/eigenteam/eigen-git-mirror && cd eigen-git-mirror && mkdir -p build && cd build && cmake .. && make && make install && ln -s /usr/local/include/eigen3/Eigen /usr/local/include/Eigen
#ADD wide_vs_tall.cpp .
RUN wget https://gist.githubusercontent.com/Dobiasd/78b32fd4aa2fc83d8da3935d690c623a/raw/5626198a533473157d6a19a824f20ebe8678e9cf/wide_vs_tall.cpp
RUN g++ -std=c++14 -O3 -march=native wide_vs_tall.cpp -o main
ADD "https://www.random.org/cgi-bin/randbyte?nbytes=10&format=h" skipcache
RUN lscpu
RUN lshw -short -C memory
RUN ./main
wget https://gist.githubusercontent.com/Dobiasd/8e27e5a96989fa8e4f942900fe609998/raw/8a07fee1a015c8c8e47066a7ac92891850b70a14/Dockerfile
docker build --rm .
产生以下结果:
托比亚斯的工作站(Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
)
359 ms <- (2, 4096) * (4096, 16384)
761 ms <- (1536, 1536) * (1536, 1536)
597 ms <- (16384, 4096) * (4096, 2)
sysbench --cpu-max-prime=20000 --num-threads=1 cpu run
CPU speed:
events per second: 491.14
Keith 的工作站(Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
)
437 ms <- (2, 4096) * (4096, 16384)
724 ms <- (1536, 1536) * (1536, 1536)
789 ms <- (16384, 4096) * (4096, 2)
sysbench --cpu-max-prime=20000 --num-threads=1 cpu run
CPU speed:
events per second: 591.58
为什么 Tobias 的工作站在 3 个 GEMM 中有 2 个比 Keith 的工作站更快,尽管 Keith 的工作站显示出更好的 sysbench 结果?我预计 i9-9960X 会更快,因为它 -march=native
包含 AVX512,并且单核时钟速度更高。
正如所建议的那样
mbw 1000
的结果显示:
i5-6600
:
AVG Method: MEMCPY Elapsed: 0.13856 MiB: 1000.00000 Copy: 7217.059 MiB/s
AVG Method: DUMB Elapsed: 0.09008 MiB: 1000.00000 Copy: 11101.625 MiB/
i9-9960X
:
AVG Method: MEMCPY Elapsed: 0.14682 MiB: 1000.00000 Copy: 6811.131 MiB/s
AVG Method: DUMB Elapsed: 0.10475 MiB: 1000.00000 Copy: 9546.631 MiB/s