与 int 乘 int 相比，为什么执行 float by float 矩阵乘法更快？

Question

有两个 int 矩阵 A 和 B，超过 1000 行和 10K 列，我经常需要将它们转换为 float 矩阵以获得加速（4 倍或更多）。

我想知道为什么会这样？我意识到浮点矩阵乘法有很多优化和矢量化，例如 AVX 等。但是，还有诸如 AVX2 之类的整数指令（如果我没记错的话）。而且，不能将 SSE 和 AVX 用于整数吗？

为什么没有像 Numpy 或 Eigen 这样的矩阵代数库下的启发式方法来捕获它并像 float 一样更快地执行整数矩阵乘法？

About accepted answer: While @sascha's answer is very informative and relevant, @chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.

Answer 1

所有这些矢量-矢量和矩阵-矢量运算都在内部使用 BLAS。 BLAS，针对不同的架构、CPU、指令和缓存大小进行了数十年的优化，没有整数类型！

Here is some branch of OpenBLAS working on it (and some tiny discussion at google-groups linking it).

我想我听说英特尔的 MKL（英特尔的 BLAS 实现）might be working on integer-types too. This talk 看起来很有趣（在那个论坛中提到），虽然它很短而且可能更接近 小整数类型 在嵌入式深度学习中很有用）。

Answer 2

如果你编译这两个基本上只是计算乘积的简单函数（使用 Eigen 库）

#include <Eigen/Core>

int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
    Eigen::MatrixXi C= A*B;
    return C(0,0);
}

int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
    Eigen::MatrixXf C= A*B;
    return C(0,0);
}

使用标志 -mavx2 -S -O3 您将看到非常相似的汇编代码，用于整数和浮点数版本。然而，主要区别在于 vpmulld 的延迟是 vmulps 的 2-3 倍，而吞吐量仅为 vmulps 的 1/2 或 1/4。（在最近的英特尔架构上）

参考：Intel Intrinsics Guide，"Throughput"表示倒数吞吐量，即如果没有延迟发生（稍微简化），每个操作使用多少个时钟周期。

与 int 乘 int 相比，为什么执行 float by float 矩阵乘法更快？

Why is it faster to perform float by float matrix multiplication compared to int by int?

c++

numpy

matrix

avx

eigen