C++ - Kazushige Goto 论文中的优化矩阵乘法在 O3 标志中的表现比 naive 差
C++ - Optimized matrix multiply from Kazushige Goto's paper performs worse than naive in O3 flag
相关论文是here。我正在尝试重现 Kazushige Goto 关于 快速矩阵乘法的开创性论文,方法是将其衰减为 gepp(通用面板面板)和 gebp(通用块面板)乘法 的子例程,其中显然是 gemm 最快的构建块。我写了下面的代码来测试它并使用 -O3
标志,我看到我的代码的性能实际上 比天真的矩阵乘法差 :
(~0.5x increase)
Time elapsed : 3.82941 <-- naive
Time elapsed : 6.21072 <-- more complex gebp subroutine
然而,没有 -O3
标志,我们看到速度确实比 naive 版本快:
(~4x increase)
Time elapsed : 53.4537 <-- naive
Time elapsed : 15.603 <-- more complex gebp subroutine
根据@ztik 的建议,我在没有 -mavx2 -O3
标志的情况下进行了尝试,并添加了 -O2
,结果与没有任何优化标志的结果相似:
(~4x increase)
Time elapsed : 26.4217 <-- naive
Time elapsed : 6.42583 <-- more complex gebp subroutine
我很清楚 -O3
在 gcc 中启用了数十个优化标志,与复杂的 Goto 论文变体相比,每个优化标志都会单独提高朴素方法的性能。然而,MKL、ATLAS 等是一些著名的 BLAS 变体,它们使用 Goto 的方法(并且比 naive 快几个数量级),尽管使用的是汇编内核而不是 C++ 代码。
这是预期的吗?如果是,我怎样才能提高性能(除了必须编写自展开程序集)?我是不是写了一些糟糕的代码,带有某种没有正确滥用缓存的错误?
环境
我 运行 在 macbook pro 上使用 AVX2 指令,intel 生成:
Intel(R) Core(TM) i7-6660U CPU @ 2.40GHz
我用g++-8
编译了下面的代码,目前是v8.2.0.
可重现的代码
可重现的代码如下:
Makefile
:
CC=g++-8
FLAGS=-std=c++17 -ffast-math -mavx2 -O3
run: main
./main
main: main.cpp
$(CC) -o main main.cpp $(FLAGS)
main.cpp
,其中包含一些用于基准测试的实用程序:
#include <iostream>
#include <cstdio>
#include <ctime>
#include <cmath>
#define TIME_IT(EXPR) \
{ \
std::clock_t __start; \
double __duration; \
__start = std::clock(); \
EXPR; \
__duration = (std::clock() - __start) / (double) CLOCKS_PER_SEC; \
std::cout << "Time elapsed : " << __duration << std::endl; \
}
static constexpr auto X_SIZE = 2048;
static constexpr auto Y_SIZE = 1024;
static constexpr auto Z_SIZE = 2048;
// = 32 floats
static constexpr auto BLK_BYTES = 128;
static constexpr auto BLK_SIZE = BLK_BYTES / 4;
template <size_t row, size_t mid, size_t col>
void initialize_matrices(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
// Initialize matrices
for(auto i = 0; i < row; i++){
for(auto j = 0; j < mid; j++){
a[i][j] = ((float) (i*Y_SIZE + j)) / (row * mid);
}
}
for(auto i = 0; i < mid; i++){
for(auto j = 0; j < col; j++){
b[i][j] = ((float) (i*Z_SIZE + j)) / (mid * col);
}
}
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
c[i][j] = 0;
}
}
}
template <size_t row, size_t mid, size_t col>
void matmul1(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
float sum = 0;
for(auto k = 0; k < mid; k++){
sum += a[i][k] * b[k][j];
}
c[i][j] = sum;
}
}
}
template <size_t col>
inline void gebp(float (&Ab)[BLK_SIZE][BLK_SIZE], float (&Bp)[BLK_SIZE][col], float (&Cp)[BLK_SIZE][col]){
// We can optimize this subroutine but that'd be overkill for now.
for(auto j = 0; j < col; j++){
for(auto i = 0; i < BLK_SIZE; i++){
float sum = 0;
for(auto k = 0; k < BLK_SIZE; k++){
sum += Ab[i][k] * Bp[k][j];
}
Cp[i][j] += sum;
}
}
}
template <size_t row, size_t col>
inline void packb(float (&a)[row][col], float (&b)[BLK_SIZE][BLK_SIZE], size_t m, size_t n){
// size_t m, n in this case means the m,n-th block to pack.
auto start_row = m * BLK_SIZE;
auto start_col = n * BLK_SIZE;
for(size_t i = 0; i < BLK_SIZE; i++){
for(size_t j = 0; j < BLK_SIZE; j++){
b[i][j] = a[start_row + i][start_col + j];
}
}
}
template <size_t row, size_t mid, size_t col>
inline void gemm(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
// Divide up the matrix into panels:
// Suppose row / BLK_SIZE = M
// col / BLK_SIZE = N
// mid / BLK_SIZE = K
//
// For the rest of the function, we assume A, B, C as a, b, c variables,
// and {var}p = panel of var
// {var}b = block of var
//
// (TODO: We assume it's perfectly divisible for now)
// Layout: A = (M,K), B = (K,N), C = (M,N)
auto M = row / BLK_SIZE;
auto N = col / BLK_SIZE;
auto K = mid / BLK_SIZE;
for(auto p = 0; p < K; p++){
// Reassign B[p*BLK_SIZE : (p+1)*BLK_SIZE][:] into Bp
float (&Bp)[BLK_SIZE][col] = *(float (*)[BLK_SIZE][col]) &b[p * BLK_SIZE];
for(auto i = 0; i < M; i++){
// Pack A[i*BLK_SIZE : (i+1)*BLK_SIZE][p*BLK_SIZE : (p+1)*BLK_SIZE] into Ab
float Ab[BLK_SIZE][BLK_SIZE];
// Reassign C[i*BLK_SIZE : (i+1)*BLK_SIZE][:] into Cp
float (&Cp)[BLK_SIZE][col] = *(float (*)[BLK_SIZE][col]) &c[i * BLK_SIZE];
packb(a, Ab, i, p);
// The result of Ab and Bp should be in Cp
gebp(Ab, Bp, Cp);
}
}
}
template <size_t row, size_t col>
bool allclose(float (&a)[row][col], float (&b)[row][col], float threshold = 1e-5, bool verbose = true){
bool is_equal = true;
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
bool current_element = std::abs(a[i][j] - b[i][j]) < threshold;
if(verbose && !current_element){
std::cerr << "Element at [" << i << "][" << j << "] is incorrect : "
<< a[i][j] << " vs. " << b[i][j] << "." << std::endl;
}
is_equal = is_equal && current_element;
}
}
return is_equal;
}
float a[X_SIZE][Y_SIZE];
float b[Y_SIZE][Z_SIZE];
float c1[X_SIZE][Z_SIZE];
float c2[X_SIZE][Z_SIZE];
int main(){
initialize_matrices(a, b, c1);
TIME_IT(matmul1(a, b, c1))
// We must guarrantee c is all zeros at first.
initialize_matrices(a, b, c2);
TIME_IT(gemm(a, b, c2))
std::cout << allclose(c1, c2, 1e-1, true) << std::endl;
return 0;
}
有诊断标志
g++-8 -o main main.cpp -std=c++17 -ffast-math -mavx2 -O3 -ftree-vectorize -fopt-info-vec-missed
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:1083:16: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/char_traits.h:320:25: note: not vectorized: not enough data-refs in basic block.
main.cpp:41:23: note: not vectorized: complicated access pattern.
main.cpp:41:23: note: bad data access.
main.cpp:42:27: note: Unknown misalignment, naturally aligned
main.cpp:36:23: note: not vectorized: complicated access pattern.
main.cpp:36:23: note: bad data access.
main.cpp:37:27: note: Unknown misalignment, naturally aligned
main.cpp:38:34: note: not vectorized: not enough data-refs in basic block.
main.cpp:38:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_a.16_42] = vect__6.15_7;
scalar_type: vector(8) float
main.cpp:38:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:36:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:43:34: note: not vectorized: not enough data-refs in basic block.
main.cpp:43:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_b.10_15] = vect__12.9_38;
scalar_type: vector(8) float
main.cpp:43:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:41:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:51:1: note: not vectorized: not enough data-refs in basic block.
main.cpp:121:23: note: not vectorized: multiple nested loops.
main.cpp:121:23: note: bad loop form.
main.cpp:125:27: note: not vectorized: multiple nested loops.
main.cpp:125:27: note: bad loop form.
main.cpp:69:23: note: not vectorized: multiple nested loops.
main.cpp:69:23: note: bad loop form.
main.cpp:70:27: note: step unknown.
main.cpp:70:27: note: not consecutive access *Cp_14[i_63][j_42] = _30;
main.cpp:70:27: note: not vectorized: complicated access pattern.
main.cpp:70:27: note: bad data access.
main.cpp:72:31: note: step unknown.
main.cpp:72:31: note: misalign = 0 bytes of ref Ab[i_63][k_64]
main.cpp:72:31: note: Unknown alignment for access: *Bp_12[k_64][j_42]
main.cpp:72:31: note: vector alignment may not be reachable
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: no array mode for V8SF[2048]
main.cpp:72:31: note: single-element interleaving not supported for not adjacent vector loads
main.cpp:72:31: note: not falling back to elementwise accesses
main.cpp:72:31: note: not vectorized: relevant stmt not supported: _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: bad operation or unsupported loop bound.
main.cpp:72:31: note: step unknown.
main.cpp:72:31: note: misalign = 0 bytes of ref Ab[i_63][k_64]
main.cpp:72:31: note: Unknown alignment for access: *Bp_12[k_64][j_42]
main.cpp:72:31: note: vector alignment may not be reachable
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: no array mode for V4SF[2048]
main.cpp:72:31: note: single-element interleaving not supported for not adjacent vector loads
main.cpp:72:31: note: not falling back to elementwise accesses
main.cpp:72:31: note: not vectorized: relevant stmt not supported: _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: bad operation or unsupported loop bound.
main.cpp:97:25: note: not vectorized: loop contains function calls or data references that cannot be analyzed
main.cpp:96:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:95:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:95:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:97:25: note: not vectorized: not enough data-refs in basic block.
main.cpp:72:31: note: not consecutive access _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: not consecutive access _23 = Ab[i_63][k_64];
main.cpp:72:31: note: not vectorized: no grouped stores in basic block.
main.cpp:70:27: note: not consecutive access _29 = *Cp_14[i_63][j_42];
main.cpp:70:27: note: not consecutive access *Cp_14[i_63][j_42] = _30;
main.cpp:70:27: note: not vectorized: no grouped stores in basic block.
main.cpp:69:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:125:27: note: not vectorized: not enough data-refs in basic block.
main.cpp:121:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:137:1: note: not vectorized: not enough data-refs in basic block.
main.cpp:140:6: note: not vectorized: control flow in loop.
main.cpp:140:6: note: bad loop form.
main.cpp:144:49: note: not vectorized: control flow in loop.
main.cpp:144:49: note: bad loop form.
main.cpp:145:13: note: not consecutive access _3 = *a_20(D)[i_17][j_131];
main.cpp:145:13: note: not consecutive access _4 = *b_21(D)[i_17][j_131];
main.cpp:145:13: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:228:43: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:228:43: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _118 = MEM[(const struct basic_ios *)_58]._M_ctype;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _55 = MEM[(struct basic_ostream *)_36]._vptr.basic_ostream;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _56 = MEM[(long int *)_55 + -24B];
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:874:2: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:875:51: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not consecutive access _127 = MEM[(const struct ctype *)_118].D.35451._vptr.facet;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not consecutive access _128 = MEM[(int (*) () *)_127 + 48B];
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:143:27: note: not vectorized: not enough data-refs in basic block.
main.cpp:142:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:152:12: note: not vectorized: not enough data-refs in basic block.
main.cpp:55:23: note: not vectorized: multiple nested loops.
main.cpp:55:23: note: bad loop form.
main.cpp:56:27: note: step unknown.
main.cpp:56:27: note: inner step doesn't divide the vector alignment.
main.cpp:56:27: note: Unknown alignment for access: a[i_57][k_59]
main.cpp:56:27: note: misalign = 0 bytes of ref b[k_59][j_58]
main.cpp:56:27: note: misalign = 0 bytes of ref c1[i_57][j_58]
main.cpp:56:27: note: Unknown misalignment, naturally aligned
main.cpp:56:27: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:56:27: note: not ssa-name.
main.cpp:56:27: note: use not simple.
main.cpp:56:27: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:56:27: note: not ssa-name.
main.cpp:56:27: note: use not simple.
main.cpp:56:27: note: can't use a fully-masked loop because no conditional operation is available.
main.cpp:162:5: note: not vectorized: not enough data-refs in basic block.
main.cpp:162:5: note: not vectorized: not enough data-refs in basic block.
main.cpp:58:31: note: not vectorized: no vectype for stmt: vect__37.134_72 = MEM[(float *)vectp_b.132_70];
scalar_type: vector(8) float
main.cpp:58:31: note: not consecutive access _36 = a[i_57][k_59];
main.cpp:58:31: note: not vectorized: no grouped stores in basic block.
main.cpp:61:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_c1.138_80] = vect_sum_40.136_76;
scalar_type: vector(8) float
main.cpp:61:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:55:23: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:562:44: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:561:18: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:562:44: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:561:18: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:175:29: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:113:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:170:1: note: not vectorized: not enough data-refs in basic block.
./main
Time elapsed : 4.03692
Time elapsed : 6.33257
手波式答案:玩弄区块大小
我尝试使用 gebp 的块大小,似乎对于 2048x1024 x 1024 x 2048 矩阵,单个浮点数似乎是 -O2
下的最佳性能并且没有向量化指令。这有点奇怪,因为我认为更大的块可以驻留在 tlb/L1/L2 缓存中。
1 float:
Time elapsed : 2.33989
8 floats:
Time elapsed : 4.34408
16 floats:
Time elapsed : 4.522
32 floats:
Time elapsed : 6.42583
64 floats:
Time elapsed : 7.35193
128 floats:
Time elapsed : 8.18502
256 floats:
Time elapsed : 8.1686
但是,这里有一些非常有趣的东西,带有矢量化指令:
1 float:
Time elapsed : 1.3284
2 floats:
Time elapsed : 1.62332
4 floats:
Time elapsed : 0.421444
8 floats:
Time elapsed : 0.527425
16 floats:
Time elapsed : 4.03877
...从这里开始情况变得更糟。 gcc 自动向量化器似乎针对 AVX 和 AVX2 指令的 4 和 8 特殊情况进行了积极优化。 这只是一个推测,而不是一个严格的发现。 在极端情况下,向量化指令的加速实际上从 4 倍提高到 8 倍!
在向量化环境下的 2048x2048 方阵乘法:
Time elapsed : 11.9983 <-- naive
Time elapsed : 0.822692 <-- vectorized
Time elapsed : 0.472855 <-- ATLAS blas GEMM
速度提升超过 10 倍。与 ATLAS 相比,在我的基准测试中(使用 xtensor
作为接口),我的实现只慢了 <2 倍。
只是一个假设,在评论中看起来不太好:
auto start_row = m * BLK_SIZE;
auto start_col = n * BLK_SIZE;
for(size_t i = 0; i < BLK_SIZE; i++){
for(size_t j = 0; j < BLK_SIZE; j++){
b[i][j] = a[start_row + i][start_col + j];
}
}
可以稍微优化成
const auto start_row = m * BLK_SIZE;
const auto start_col = n * BLK_SIZE;
const auto end_row = start_row + BLK_SIZE;
const auto end_col = start_col + BLK_SIZE;
for(size_t i = start_row; i < end_row; i++){
for(size_t j = start_col; j < end_col; j++){
b[i][j] = a[i][j];
}
}
它可能已经由编译器完成,但谁知道不检查汇编输出呢?
这里的memcpy不好吗?它看起来像一个副本。
相关论文是here。我正在尝试重现 Kazushige Goto 关于 快速矩阵乘法的开创性论文,方法是将其衰减为 gepp(通用面板面板)和 gebp(通用块面板)乘法 的子例程,其中显然是 gemm 最快的构建块。我写了下面的代码来测试它并使用 -O3
标志,我看到我的代码的性能实际上 比天真的矩阵乘法差 :
(~0.5x increase)
Time elapsed : 3.82941 <-- naive
Time elapsed : 6.21072 <-- more complex gebp subroutine
然而,没有 -O3
标志,我们看到速度确实比 naive 版本快:
(~4x increase)
Time elapsed : 53.4537 <-- naive
Time elapsed : 15.603 <-- more complex gebp subroutine
根据@ztik 的建议,我在没有 -mavx2 -O3
标志的情况下进行了尝试,并添加了 -O2
,结果与没有任何优化标志的结果相似:
(~4x increase)
Time elapsed : 26.4217 <-- naive
Time elapsed : 6.42583 <-- more complex gebp subroutine
我很清楚 -O3
在 gcc 中启用了数十个优化标志,与复杂的 Goto 论文变体相比,每个优化标志都会单独提高朴素方法的性能。然而,MKL、ATLAS 等是一些著名的 BLAS 变体,它们使用 Goto 的方法(并且比 naive 快几个数量级),尽管使用的是汇编内核而不是 C++ 代码。
这是预期的吗?如果是,我怎样才能提高性能(除了必须编写自展开程序集)?我是不是写了一些糟糕的代码,带有某种没有正确滥用缓存的错误?
环境
我 运行 在 macbook pro 上使用 AVX2 指令,intel 生成:
Intel(R) Core(TM) i7-6660U CPU @ 2.40GHz
我用g++-8
编译了下面的代码,目前是v8.2.0.
可重现的代码
可重现的代码如下:
Makefile
:
CC=g++-8
FLAGS=-std=c++17 -ffast-math -mavx2 -O3
run: main
./main
main: main.cpp
$(CC) -o main main.cpp $(FLAGS)
main.cpp
,其中包含一些用于基准测试的实用程序:
#include <iostream>
#include <cstdio>
#include <ctime>
#include <cmath>
#define TIME_IT(EXPR) \
{ \
std::clock_t __start; \
double __duration; \
__start = std::clock(); \
EXPR; \
__duration = (std::clock() - __start) / (double) CLOCKS_PER_SEC; \
std::cout << "Time elapsed : " << __duration << std::endl; \
}
static constexpr auto X_SIZE = 2048;
static constexpr auto Y_SIZE = 1024;
static constexpr auto Z_SIZE = 2048;
// = 32 floats
static constexpr auto BLK_BYTES = 128;
static constexpr auto BLK_SIZE = BLK_BYTES / 4;
template <size_t row, size_t mid, size_t col>
void initialize_matrices(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
// Initialize matrices
for(auto i = 0; i < row; i++){
for(auto j = 0; j < mid; j++){
a[i][j] = ((float) (i*Y_SIZE + j)) / (row * mid);
}
}
for(auto i = 0; i < mid; i++){
for(auto j = 0; j < col; j++){
b[i][j] = ((float) (i*Z_SIZE + j)) / (mid * col);
}
}
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
c[i][j] = 0;
}
}
}
template <size_t row, size_t mid, size_t col>
void matmul1(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
float sum = 0;
for(auto k = 0; k < mid; k++){
sum += a[i][k] * b[k][j];
}
c[i][j] = sum;
}
}
}
template <size_t col>
inline void gebp(float (&Ab)[BLK_SIZE][BLK_SIZE], float (&Bp)[BLK_SIZE][col], float (&Cp)[BLK_SIZE][col]){
// We can optimize this subroutine but that'd be overkill for now.
for(auto j = 0; j < col; j++){
for(auto i = 0; i < BLK_SIZE; i++){
float sum = 0;
for(auto k = 0; k < BLK_SIZE; k++){
sum += Ab[i][k] * Bp[k][j];
}
Cp[i][j] += sum;
}
}
}
template <size_t row, size_t col>
inline void packb(float (&a)[row][col], float (&b)[BLK_SIZE][BLK_SIZE], size_t m, size_t n){
// size_t m, n in this case means the m,n-th block to pack.
auto start_row = m * BLK_SIZE;
auto start_col = n * BLK_SIZE;
for(size_t i = 0; i < BLK_SIZE; i++){
for(size_t j = 0; j < BLK_SIZE; j++){
b[i][j] = a[start_row + i][start_col + j];
}
}
}
template <size_t row, size_t mid, size_t col>
inline void gemm(float (&a)[row][mid], float (&b)[mid][col], float (&c)[row][col]){
// Divide up the matrix into panels:
// Suppose row / BLK_SIZE = M
// col / BLK_SIZE = N
// mid / BLK_SIZE = K
//
// For the rest of the function, we assume A, B, C as a, b, c variables,
// and {var}p = panel of var
// {var}b = block of var
//
// (TODO: We assume it's perfectly divisible for now)
// Layout: A = (M,K), B = (K,N), C = (M,N)
auto M = row / BLK_SIZE;
auto N = col / BLK_SIZE;
auto K = mid / BLK_SIZE;
for(auto p = 0; p < K; p++){
// Reassign B[p*BLK_SIZE : (p+1)*BLK_SIZE][:] into Bp
float (&Bp)[BLK_SIZE][col] = *(float (*)[BLK_SIZE][col]) &b[p * BLK_SIZE];
for(auto i = 0; i < M; i++){
// Pack A[i*BLK_SIZE : (i+1)*BLK_SIZE][p*BLK_SIZE : (p+1)*BLK_SIZE] into Ab
float Ab[BLK_SIZE][BLK_SIZE];
// Reassign C[i*BLK_SIZE : (i+1)*BLK_SIZE][:] into Cp
float (&Cp)[BLK_SIZE][col] = *(float (*)[BLK_SIZE][col]) &c[i * BLK_SIZE];
packb(a, Ab, i, p);
// The result of Ab and Bp should be in Cp
gebp(Ab, Bp, Cp);
}
}
}
template <size_t row, size_t col>
bool allclose(float (&a)[row][col], float (&b)[row][col], float threshold = 1e-5, bool verbose = true){
bool is_equal = true;
for(auto i = 0; i < row; i++){
for(auto j = 0; j < col; j++){
bool current_element = std::abs(a[i][j] - b[i][j]) < threshold;
if(verbose && !current_element){
std::cerr << "Element at [" << i << "][" << j << "] is incorrect : "
<< a[i][j] << " vs. " << b[i][j] << "." << std::endl;
}
is_equal = is_equal && current_element;
}
}
return is_equal;
}
float a[X_SIZE][Y_SIZE];
float b[Y_SIZE][Z_SIZE];
float c1[X_SIZE][Z_SIZE];
float c2[X_SIZE][Z_SIZE];
int main(){
initialize_matrices(a, b, c1);
TIME_IT(matmul1(a, b, c1))
// We must guarrantee c is all zeros at first.
initialize_matrices(a, b, c2);
TIME_IT(gemm(a, b, c2))
std::cout << allclose(c1, c2, 1e-1, true) << std::endl;
return 0;
}
有诊断标志
g++-8 -o main main.cpp -std=c++17 -ffast-math -mavx2 -O3 -ftree-vectorize -fopt-info-vec-missed
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:1083:16: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/char_traits.h:320:25: note: not vectorized: not enough data-refs in basic block.
main.cpp:41:23: note: not vectorized: complicated access pattern.
main.cpp:41:23: note: bad data access.
main.cpp:42:27: note: Unknown misalignment, naturally aligned
main.cpp:36:23: note: not vectorized: complicated access pattern.
main.cpp:36:23: note: bad data access.
main.cpp:37:27: note: Unknown misalignment, naturally aligned
main.cpp:38:34: note: not vectorized: not enough data-refs in basic block.
main.cpp:38:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_a.16_42] = vect__6.15_7;
scalar_type: vector(8) float
main.cpp:38:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:36:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:43:34: note: not vectorized: not enough data-refs in basic block.
main.cpp:43:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_b.10_15] = vect__12.9_38;
scalar_type: vector(8) float
main.cpp:43:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:41:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:51:1: note: not vectorized: not enough data-refs in basic block.
main.cpp:121:23: note: not vectorized: multiple nested loops.
main.cpp:121:23: note: bad loop form.
main.cpp:125:27: note: not vectorized: multiple nested loops.
main.cpp:125:27: note: bad loop form.
main.cpp:69:23: note: not vectorized: multiple nested loops.
main.cpp:69:23: note: bad loop form.
main.cpp:70:27: note: step unknown.
main.cpp:70:27: note: not consecutive access *Cp_14[i_63][j_42] = _30;
main.cpp:70:27: note: not vectorized: complicated access pattern.
main.cpp:70:27: note: bad data access.
main.cpp:72:31: note: step unknown.
main.cpp:72:31: note: misalign = 0 bytes of ref Ab[i_63][k_64]
main.cpp:72:31: note: Unknown alignment for access: *Bp_12[k_64][j_42]
main.cpp:72:31: note: vector alignment may not be reachable
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: no array mode for V8SF[2048]
main.cpp:72:31: note: single-element interleaving not supported for not adjacent vector loads
main.cpp:72:31: note: not falling back to elementwise accesses
main.cpp:72:31: note: not vectorized: relevant stmt not supported: _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: bad operation or unsupported loop bound.
main.cpp:72:31: note: step unknown.
main.cpp:72:31: note: misalign = 0 bytes of ref Ab[i_63][k_64]
main.cpp:72:31: note: Unknown alignment for access: *Bp_12[k_64][j_42]
main.cpp:72:31: note: vector alignment may not be reachable
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:72:31: note: not ssa-name.
main.cpp:72:31: note: use not simple.
main.cpp:72:31: note: no array mode for V4SF[2048]
main.cpp:72:31: note: single-element interleaving not supported for not adjacent vector loads
main.cpp:72:31: note: not falling back to elementwise accesses
main.cpp:72:31: note: not vectorized: relevant stmt not supported: _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: bad operation or unsupported loop bound.
main.cpp:97:25: note: not vectorized: loop contains function calls or data references that cannot be analyzed
main.cpp:96:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:95:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:95:10: note: not vectorized: not enough data-refs in basic block.
main.cpp:97:25: note: not vectorized: not enough data-refs in basic block.
main.cpp:72:31: note: not consecutive access _24 = *Bp_12[k_64][j_42];
main.cpp:72:31: note: not consecutive access _23 = Ab[i_63][k_64];
main.cpp:72:31: note: not vectorized: no grouped stores in basic block.
main.cpp:70:27: note: not consecutive access _29 = *Cp_14[i_63][j_42];
main.cpp:70:27: note: not consecutive access *Cp_14[i_63][j_42] = _30;
main.cpp:70:27: note: not vectorized: no grouped stores in basic block.
main.cpp:69:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:125:27: note: not vectorized: not enough data-refs in basic block.
main.cpp:121:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:137:1: note: not vectorized: not enough data-refs in basic block.
main.cpp:140:6: note: not vectorized: control flow in loop.
main.cpp:140:6: note: bad loop form.
main.cpp:144:49: note: not vectorized: control flow in loop.
main.cpp:144:49: note: bad loop form.
main.cpp:145:13: note: not consecutive access _3 = *a_20(D)[i_17][j_131];
main.cpp:145:13: note: not consecutive access _4 = *b_21(D)[i_17][j_131];
main.cpp:145:13: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:228:43: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:228:43: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _118 = MEM[(const struct basic_ios *)_58]._M_ctype;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _55 = MEM[(struct basic_ostream *)_36]._vptr.basic_ostream;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not consecutive access _56 = MEM[(long int *)_55 + -24B];
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/basic_ios.h:49:7: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:874:2: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:875:51: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not consecutive access _127 = MEM[(const struct ctype *)_118].D.35451._vptr.facet;
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not consecutive access _128 = MEM[(int (*) () *)_127 + 48B];
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:27: note: not vectorized: no grouped stores in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/bits/locale_facets.h:877:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:143:27: note: not vectorized: not enough data-refs in basic block.
main.cpp:142:23: note: not vectorized: not enough data-refs in basic block.
main.cpp:152:12: note: not vectorized: not enough data-refs in basic block.
main.cpp:55:23: note: not vectorized: multiple nested loops.
main.cpp:55:23: note: bad loop form.
main.cpp:56:27: note: step unknown.
main.cpp:56:27: note: inner step doesn't divide the vector alignment.
main.cpp:56:27: note: Unknown alignment for access: a[i_57][k_59]
main.cpp:56:27: note: misalign = 0 bytes of ref b[k_59][j_58]
main.cpp:56:27: note: misalign = 0 bytes of ref c1[i_57][j_58]
main.cpp:56:27: note: Unknown misalignment, naturally aligned
main.cpp:56:27: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:56:27: note: not ssa-name.
main.cpp:56:27: note: use not simple.
main.cpp:56:27: note: num. args = 4 (not unary/binary/ternary op).
main.cpp:56:27: note: not ssa-name.
main.cpp:56:27: note: use not simple.
main.cpp:56:27: note: can't use a fully-masked loop because no conditional operation is available.
main.cpp:162:5: note: not vectorized: not enough data-refs in basic block.
main.cpp:162:5: note: not vectorized: not enough data-refs in basic block.
main.cpp:58:31: note: not vectorized: no vectype for stmt: vect__37.134_72 = MEM[(float *)vectp_b.132_70];
scalar_type: vector(8) float
main.cpp:58:31: note: not consecutive access _36 = a[i_57][k_59];
main.cpp:58:31: note: not vectorized: no grouped stores in basic block.
main.cpp:61:13: note: not vectorized: no vectype for stmt: MEM[(float *)vectp_c1.138_80] = vect_sum_40.136_76;
scalar_type: vector(8) float
main.cpp:61:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:55:23: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:562:44: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:561:18: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:562:44: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:561:18: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:175:29: note: not vectorized: not enough data-refs in basic block.
/usr/local/Cellar/gcc/8.2.0/include/c++/8.2.0/ostream:113:13: note: not vectorized: not enough data-refs in basic block.
main.cpp:170:1: note: not vectorized: not enough data-refs in basic block.
./main
Time elapsed : 4.03692
Time elapsed : 6.33257
手波式答案:玩弄区块大小
我尝试使用 gebp 的块大小,似乎对于 2048x1024 x 1024 x 2048 矩阵,单个浮点数似乎是 -O2
下的最佳性能并且没有向量化指令。这有点奇怪,因为我认为更大的块可以驻留在 tlb/L1/L2 缓存中。
1 float:
Time elapsed : 2.33989
8 floats:
Time elapsed : 4.34408
16 floats:
Time elapsed : 4.522
32 floats:
Time elapsed : 6.42583
64 floats:
Time elapsed : 7.35193
128 floats:
Time elapsed : 8.18502
256 floats:
Time elapsed : 8.1686
但是,这里有一些非常有趣的东西,带有矢量化指令:
1 float:
Time elapsed : 1.3284
2 floats:
Time elapsed : 1.62332
4 floats:
Time elapsed : 0.421444
8 floats:
Time elapsed : 0.527425
16 floats:
Time elapsed : 4.03877
...从这里开始情况变得更糟。 gcc 自动向量化器似乎针对 AVX 和 AVX2 指令的 4 和 8 特殊情况进行了积极优化。 这只是一个推测,而不是一个严格的发现。 在极端情况下,向量化指令的加速实际上从 4 倍提高到 8 倍!
在向量化环境下的 2048x2048 方阵乘法:
Time elapsed : 11.9983 <-- naive
Time elapsed : 0.822692 <-- vectorized
Time elapsed : 0.472855 <-- ATLAS blas GEMM
速度提升超过 10 倍。与 ATLAS 相比,在我的基准测试中(使用 xtensor
作为接口),我的实现只慢了 <2 倍。
只是一个假设,在评论中看起来不太好:
auto start_row = m * BLK_SIZE;
auto start_col = n * BLK_SIZE;
for(size_t i = 0; i < BLK_SIZE; i++){
for(size_t j = 0; j < BLK_SIZE; j++){
b[i][j] = a[start_row + i][start_col + j];
}
}
可以稍微优化成
const auto start_row = m * BLK_SIZE;
const auto start_col = n * BLK_SIZE;
const auto end_row = start_row + BLK_SIZE;
const auto end_col = start_col + BLK_SIZE;
for(size_t i = start_row; i < end_row; i++){
for(size_t j = start_col; j < end_col; j++){
b[i][j] = a[i][j];
}
}
它可能已经由编译器完成,但谁知道不检查汇编输出呢?
这里的memcpy不好吗?它看起来像一个副本。