为什么矩阵加法比 Eigen 中的矩阵向量乘法慢？

Question

为什么矩阵加法比矩阵向量乘法花费的时间长得多？

矩阵加法只需要 n^2 次加法，而矩阵向量乘法需要 n*(n-1) 次加法和 n^2 次乘法。

然而，在 Eigen 中，矩阵相加花费的时间是矩阵向量乘法。在 Eigen 中是否有任何选项可以加快 Matrix Add 操作？

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=100;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::milliseconds>(end - start).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration/1000<< "s" << std::endl;
auto start1 = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::milliseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration1/1000<< "s" << std::endl;
}

测试 1：没有任何优化：

编译命令：g++-8 -test.cpp -o test

运行命令：./test

经过的时间（以秒为单位）：0.323 秒

经过的时间（秒）：0.635s

测试 2：使用 -march=native 优化：

g++-8 test.cpp -march=native -o test

运行命令：./test

经过的时间（秒）：0.21s

经过的时间（以秒为单位）：0.372 秒

测试 3：使用 -O3 优化：

编译命令：g++-8 -test.cpp -O3 -o test

运行命令：./test

经过的时间（秒）：0.009s

经过的时间（以秒为单位）：0.016 秒

测试 4：使用 -march=native，-O3 优化：

编译命令：g++-8 -test.cpp -march=native -O3 -o test

运行命令：./test

经过的时间（秒）：0.008s

经过的时间（以秒为单位）：0.016 秒

==============

我注意到编译器可能作弊的评论，因为我没有使用上一次迭代的结果。为了解决这个问题，我改为进行一次迭代并使用更大的尺寸来进行稳定的时间统计。

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=1000;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::microseconds>(end - start).count();

auto start1 = chrono::steady_clock::now();
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::microseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration<< "us" << std::endl;
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration1<< "us" << std::endl;
}

测试 1：没有任何优化：

编译命令：g++-8 -test.cpp -o test

运行命令：./test

经过的时间（微秒）：3125us

经过的时间（以微秒为单位）：6849us

测试 2：使用 -march=native 优化：

g++-8 test.cpp -march=native -o test

运行命令：./test

经过的时间（以微秒为单位）：1776us

经过的时间（以微秒为单位）：3815us

测试 3：使用 -O3 优化：

编译命令：g++-8 -test.cpp -O3 -o test

运行命令：./test

经过的时间（以微秒为单位）：449us

经过的时间（以微秒为单位）：760us

测试 4：使用 -march=native，-O3 优化：

编译命令：g++-8 -test.cpp -march=native -O3 -o test

运行命令：./test

经过的时间（以微秒为单位）：351us

经过的时间（以微秒为单位）：871us

Answer 1

简短回答：您计算了操作的数量，但忽略了计算内存访问，对于加法情况，内存访问的成本高出近 x2 倍。详情如下。

首先，这两种运算的实际运算次数是相同的，因为现代CPUs能够同时执行一个独立的加法和乘法。两个连续的 mul/add 像 x*y+z 甚至可以融合为一个单一的运算，其成本与 1 次加法或 1 次乘法相同。如果您的 CPU 支持 FMA，那么 -march=native 就会发生这种情况，但我怀疑 FMA 在这里发挥任何作用。

其次，您在微积分中忘记测量内存访问次数。回想一下，除非数据已经在 L1 缓存中，否则一次内存加载比一次加法或一次乘法要昂贵得多。

添加很简单：我们有 2*n^2 个缓存未命中的负载，还有 n^2 个存储。

对于具有列主矩阵的矩阵向量乘积，输入向量只被读取一次，因此 n^2+n 加载输入，并且由于列由 4 列的块一次处理我们对输出向量进行了 n^2/4 次读写，但缓存未命中几乎为零，因为它适合 L1 缓存。所以总的来说，与矩阵向量乘积相比，加法的内存负载成本高出近 x2，因此 x2 速度因子并不异常。

此外，矩阵向量代码通过显式循环剥离进行了更积极的优化，但我怀疑这会对这个基准测试产生任何影响，因为你的矩阵根本不适合 L1 缓存。

为什么矩阵加法比 Eigen 中的矩阵向量乘法慢？

Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen?

c++

performance

matrix

eigen

为什么矩阵加法比 Eigen 中的矩阵向​​量乘法慢？

Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen?

c++

performance

matrix

eigen

为什么矩阵加法比 Eigen 中的矩阵向量乘法慢？