解释为什么第二次分配会改变性能
Explanation for why allocating a second time changes performance
我正在测试一些关于密集矩阵乘法的微基准(出于好奇),我注意到一些非常奇怪的性能结果。
这是一个最小的工作示例:
#include <benchmark/benchmark.h>
#include <random>
constexpr long long n = 128;
struct mat_bench_fixture : public benchmark::Fixture
{
double *matA, *matB, *matC;
mat_bench_fixture()
{
matA = new double[n * n];
matB = new double[n * n];
matC = new double[n * n];
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
#if 0
delete[] matA;
delete[] matB;
delete[] matC;
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
matA = new double[n * n];
matB = new double[n * n];
matC = new double[n * n];
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
#endif
}
~mat_bench_fixture()
{
delete[] matA;
delete[] matB;
delete[] matC;
}
void SetUp(const benchmark::State& s) override
{
// generate random data
std::mt19937 gen;
std::uniform_real_distribution<double> dis(0, 1);
for (double* i = matA; i != matA + n * n; ++i)
{
*i = dis(gen);
}
for (double* i = matB; i != matB + n * n; ++i)
{
*i = dis(gen);
}
}
};
BENCHMARK_DEFINE_F(mat_bench_fixture, impl1)(benchmark::State& st)
{
for (auto _ : st)
{
for (long long row = 0; row < n; ++row)
{
for (long long col = 0; col < n; ++col)
{
matC[row * n + col] = 0;
for (long long k = 0; k < n; ++k)
{
matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
}
}
}
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
benchmark::ClobberMemory();
}
}
BENCHMARK_REGISTER_F(mat_bench_fixture, impl1);
BENCHMARK_MAIN();
夹具的构造函数中有一个 #if 0
块,可以针对我正在测试的两个不同场景将其切换为 #if 1
。我注意到的是,出于某种原因,当我强制重新分配所有缓冲区时,出于某种原因,在我的系统上,基准测试 运行 所需的时间神奇地提高了大约 15%,而且我没有解释为什么会这样。我希望有人能启发我。我还想知道是否有任何额外的微基准测试 "best practices" 建议来避免将来出现这种奇怪的性能异常。
我是如何编译的(假设 Google Benchmark 已经安装在可以找到的地方):
$CC -o mult_test mult_test.cpp -std=c++14 -pthread -O3 -fno-omit-frame-pointer -lbenchmark
我一直在 运行 解决这个问题:
./mult_test --benchmark_repetitions=5
我正在 Ubuntu 18.04 x64(内核版本 4.15.0-30-generic)
中进行所有测试
我尝试了此代码的几种不同变体,它们在多个 运行 上都给出了相同的基本结果(结果对我来说如此一致令人惊讶):
- 将 allocation/initialization 移动到基准 "SetUp" 阶段(非计时部分),以便 allocation/deallocation 在每个新样本点发生
- 在 GCC 7.3.0 和 Clang 6.0.0 之间切换编译器
- 尝试了具有不同 CPU 的不同计算机(Intel i5-6600K,以及一个具有双插槽 Xeon E5-2630 v2)
- 尝试了不同的方法来实现基准框架(即根本不使用 Google 基准并通过 std::chrono 手动实现计时)
- 强制所有缓冲区对齐到几个不同的边界(64 字节、128 字节、256 字节)
- 在每个采样时间周期内强制进行固定次数的迭代
- 尝试了 运行 更高的重复次数 (20+)
- 使用性能调控器强制恒定 CPU 时钟频率
- 为优化选项尝试了不同的编译器标志(删除了 no-omit-frame-pointer,尝试了 -march=native)
- 我试过使用 std::vector 来管理存储,使用 new[]/delete[] 对和 malloc/free。他们都给出了相似的结果。
我比较了代码的热点部分的汇编,两个测试用例之间是相同的(其中一个案例的 perf 截图):
40:
mov 0xc0(%r15),$rcx
mov 0xd0(%r15),%rdx
add [=11=]x8,$rcx
move 0xc8(%r15),%r9
add %r8,%r9
xor %r10d,%r10d
nop
60:
mov %r10,%r11
shl [=11=]x7,$r11
mov %r9,%r13
xor %esi,%esi
nop
70:
lea (%rsi,%r11,1),%rax
movq %0x0,(%rdx,%rax,8)
xordp %xmm0,%xmm0
mov [=11=]xffffffffffffff80,%rdi
mov %r13,%rbx
nop
90:
movsd 0x3f8(%rcx,%rdi,8),%xmm1
mulsd -0x400(%rbx),%xmm1
addsd %xmm0,%xmm1
movsd %xmm1,(%rdx,%rax,8)
movsd 0x400(%rcs,%rdi,8),%xmm0
mulsd (%rbx),%xmm0
addsd %xmm1,%xmm0
movsd %xmm0,(%rdx,%rax,8)
add [=11=]x800,%rbx
add [=11=]x2,%rdi
jne 90
add [=11=]x1,%rsi
add [=11=]x8,%r13
cmp [=11=]x80,%rsi
jne 70
add [=11=]x1,%r10
add [=11=]x400,%rcx
cmp [=11=]x80,%r10
jne 60
add [=11=]xffffffffffffffff,%r12
jne 40
以下是未执行重新分配的 perf stat 的代表性屏幕截图:
Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1 2181531 ns 2180896 ns 322
mat_bench_fixture/impl1 2188280 ns 2186860 ns 322
mat_bench_fixture/impl1 2182988 ns 2182150 ns 322
mat_bench_fixture/impl1 2182715 ns 2182025 ns 322
mat_bench_fixture/impl1 2175719 ns 2175653 ns 322
mat_bench_fixture/impl1_mean 2182246 ns 2181517 ns 322
mat_bench_fixture/impl1_median 2182715 ns 2182025 ns 322
mat_bench_fixture/impl1_stddev 4480 ns 4000 ns 322
Performance counter stats for './mult_test --benchmark_repetitions=5':
3771.370173 task-clock (msec) # 0.994 CPUs utilized
223 context-switches # 0.059 K/sec
0 cpu-migrations # 0.000 K/sec
242 page-faults # 0.064 K/sec
15,808,590,474 cycles # 4.192 GHz (61.31%)
20,201,201,797 instructions # 1.28 insn per cycle (69.04%)
1,844,097,332 branches # 488.973 M/sec (69.04%)
358,319 branch-misses # 0.02% of all branches (69.14%)
7,232,957,363 L1-dcache-loads # 1917.859 M/sec (69.24%)
3,774,591,187 L1-dcache-load-misses # 52.19% of all L1-dcache hits (69.35%)
558,507,528 LLC-loads # 148.091 M/sec (69.46%)
93,136 LLC-load-misses # 0.02% of all LL-cache hits (69.47%)
<not supported> L1-icache-loads
736,008 L1-icache-load-misses (69.47%)
7,242,324,412 dTLB-loads # 1920.343 M/sec (69.34%)
581 dTLB-load-misses # 0.00% of all dTLB cache hits (61.50%)
1,582 iTLB-loads # 0.419 K/sec (61.39%)
307 iTLB-load-misses # 19.41% of all iTLB cache hits (61.29%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
3.795924436 seconds time elapsed
这是用于强制重新分配的 perf stat 的代表性屏幕截图:
Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1 1862961 ns 1862919 ns 376
mat_bench_fixture/impl1 1861986 ns 1861947 ns 376
mat_bench_fixture/impl1 1860330 ns 1860305 ns 376
mat_bench_fixture/impl1 1859711 ns 1859652 ns 376
mat_bench_fixture/impl1 1863299 ns 1863273 ns 376
mat_bench_fixture/impl1_mean 1861658 ns 1861619 ns 376
mat_bench_fixture/impl1_median 1861986 ns 1861947 ns 376
mat_bench_fixture/impl1_stddev 1585 ns 1591 ns 376
Performance counter stats for './mult_test --benchmark_repetitions=5':
3724.287293 task-clock (msec) # 0.995 CPUs utilized
11 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
246 page-faults # 0.066 K/sec
15,612,924,579 cycles # 4.192 GHz (61.34%)
23,344,859,019 instructions # 1.50 insn per cycle (69.07%)
2,130,528,330 branches # 572.063 M/sec (69.07%)
331,651 branch-misses # 0.02% of all branches (69.08%)
8,369,233,786 L1-dcache-loads # 2247.204 M/sec (69.18%)
4,206,241,296 L1-dcache-load-misses # 50.26% of all L1-dcache hits (69.29%)
308,687,646 LLC-loads # 82.885 M/sec (69.40%)
94,288 LLC-load-misses # 0.03% of all LL-cache hits (69.50%)
<not supported> L1-icache-loads
475,066 L1-icache-load-misses (69.50%)
8,360,570,315 dTLB-loads # 2244.878 M/sec (69.37%)
364 dTLB-load-misses # 0.00% of all dTLB cache hits (61.53%)
213 iTLB-loads # 0.057 K/sec (61.42%)
144 iTLB-load-misses # 67.61% of all iTLB cache hits (61.32%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
3.743017809 seconds time elapsed
这是一个最小的工作示例,它没有任何外部依赖项,并允许测试内存对齐问题:
#include <random>
#include <chrono>
#include <iostream>
#include <cstdlib>
constexpr long long n = 128;
constexpr size_t alignment = 64;
inline void escape(void* p)
{
asm volatile("" : : "g"(p) : "memory");
}
inline void clobber()
{
asm volatile("" : : : "memory");
}
struct mat_bench_fixture
{
double *matA, *matB, *matC;
mat_bench_fixture()
{
matA = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
matB = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
matC = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
escape(matA);
escape(matB);
escape(matC);
#if 0
free(matA);
free(matB);
free(matC);
escape(matA);
escape(matB);
escape(matC);
matA = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
matB = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
matC = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
escape(matA);
escape(matB);
escape(matC);
#endif
}
~mat_bench_fixture()
{
free(matA);
free(matB);
free(matC);
}
void SetUp()
{
// generate random data
std::mt19937 gen;
std::uniform_real_distribution<double> dis(0, 1);
for (double* i = matA; i != matA + n * n; ++i)
{
*i = dis(gen);
}
for (double* i = matB; i != matB + n * n; ++i)
{
*i = dis(gen);
}
}
void run()
{
constexpr int iters = 400;
std::chrono::high_resolution_clock timer;
auto start = timer.now();
for (int i = 0; i < iters; ++i)
{
for (long long row = 0; row < n; ++row)
{
for (long long col = 0; col < n; ++col)
{
matC[row * n + col] = 0;
for (long long k = 0; k < n; ++k)
{
matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
}
}
}
escape(matA);
escape(matB);
escape(matC);
clobber();
}
auto stop = timer.now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(
stop - start)
.count() /
iters
<< std::endl;
}
};
int main()
{
mat_bench_fixture bench;
for (int i = 0; i < 5; ++i)
{
bench.SetUp();
bench.run();
}
}
编译:
g++ -o mult_test mult_test.cpp -std=c++14 -O3
在我的机器上,我可以通过对指针使用不同的对齐方式来重现您的案例。试试这个代码:
mat_bench_fixture() {
matA = new double[n * n + 256];
matB = new double[n * n + 256];
matC = new double[n * n + 256];
// align pointers to 1024
matA = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matA) + 1023)&~1023);
matB = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matB) + 1023)&~1023);
matC = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matC) + 1023)&~1023);
// toggle this to toggle alignment offset of matB
// matB += 2;
}
如果我切换这段代码中的注释行,我的机器上有 34% 的差异。
不同的对齐偏移导致不同的时间。您也可以尝试抵消其他 2 个指针。有时差异较小,有时较大,有时没有变化。
这一定是由缓存问题引起的:由于指针的最后一位不同,缓存中会出现不同的冲突模式。由于您的例程是内存密集型的(所有数据都不适合 L1),因此缓存性能非常重要。
我正在测试一些关于密集矩阵乘法的微基准(出于好奇),我注意到一些非常奇怪的性能结果。
这是一个最小的工作示例:
#include <benchmark/benchmark.h>
#include <random>
constexpr long long n = 128;
struct mat_bench_fixture : public benchmark::Fixture
{
double *matA, *matB, *matC;
mat_bench_fixture()
{
matA = new double[n * n];
matB = new double[n * n];
matC = new double[n * n];
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
#if 0
delete[] matA;
delete[] matB;
delete[] matC;
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
matA = new double[n * n];
matB = new double[n * n];
matC = new double[n * n];
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
#endif
}
~mat_bench_fixture()
{
delete[] matA;
delete[] matB;
delete[] matC;
}
void SetUp(const benchmark::State& s) override
{
// generate random data
std::mt19937 gen;
std::uniform_real_distribution<double> dis(0, 1);
for (double* i = matA; i != matA + n * n; ++i)
{
*i = dis(gen);
}
for (double* i = matB; i != matB + n * n; ++i)
{
*i = dis(gen);
}
}
};
BENCHMARK_DEFINE_F(mat_bench_fixture, impl1)(benchmark::State& st)
{
for (auto _ : st)
{
for (long long row = 0; row < n; ++row)
{
for (long long col = 0; col < n; ++col)
{
matC[row * n + col] = 0;
for (long long k = 0; k < n; ++k)
{
matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
}
}
}
benchmark::DoNotOptimize(matA);
benchmark::DoNotOptimize(matB);
benchmark::DoNotOptimize(matC);
benchmark::ClobberMemory();
}
}
BENCHMARK_REGISTER_F(mat_bench_fixture, impl1);
BENCHMARK_MAIN();
夹具的构造函数中有一个 #if 0
块,可以针对我正在测试的两个不同场景将其切换为 #if 1
。我注意到的是,出于某种原因,当我强制重新分配所有缓冲区时,出于某种原因,在我的系统上,基准测试 运行 所需的时间神奇地提高了大约 15%,而且我没有解释为什么会这样。我希望有人能启发我。我还想知道是否有任何额外的微基准测试 "best practices" 建议来避免将来出现这种奇怪的性能异常。
我是如何编译的(假设 Google Benchmark 已经安装在可以找到的地方):
$CC -o mult_test mult_test.cpp -std=c++14 -pthread -O3 -fno-omit-frame-pointer -lbenchmark
我一直在 运行 解决这个问题:
./mult_test --benchmark_repetitions=5
我正在 Ubuntu 18.04 x64(内核版本 4.15.0-30-generic)
中进行所有测试我尝试了此代码的几种不同变体,它们在多个 运行 上都给出了相同的基本结果(结果对我来说如此一致令人惊讶):
- 将 allocation/initialization 移动到基准 "SetUp" 阶段(非计时部分),以便 allocation/deallocation 在每个新样本点发生
- 在 GCC 7.3.0 和 Clang 6.0.0 之间切换编译器
- 尝试了具有不同 CPU 的不同计算机(Intel i5-6600K,以及一个具有双插槽 Xeon E5-2630 v2)
- 尝试了不同的方法来实现基准框架(即根本不使用 Google 基准并通过 std::chrono 手动实现计时)
- 强制所有缓冲区对齐到几个不同的边界(64 字节、128 字节、256 字节)
- 在每个采样时间周期内强制进行固定次数的迭代
- 尝试了 运行 更高的重复次数 (20+)
- 使用性能调控器强制恒定 CPU 时钟频率
- 为优化选项尝试了不同的编译器标志(删除了 no-omit-frame-pointer,尝试了 -march=native)
- 我试过使用 std::vector 来管理存储,使用 new[]/delete[] 对和 malloc/free。他们都给出了相似的结果。
我比较了代码的热点部分的汇编,两个测试用例之间是相同的(其中一个案例的 perf 截图):
40:
mov 0xc0(%r15),$rcx
mov 0xd0(%r15),%rdx
add [=11=]x8,$rcx
move 0xc8(%r15),%r9
add %r8,%r9
xor %r10d,%r10d
nop
60:
mov %r10,%r11
shl [=11=]x7,$r11
mov %r9,%r13
xor %esi,%esi
nop
70:
lea (%rsi,%r11,1),%rax
movq %0x0,(%rdx,%rax,8)
xordp %xmm0,%xmm0
mov [=11=]xffffffffffffff80,%rdi
mov %r13,%rbx
nop
90:
movsd 0x3f8(%rcx,%rdi,8),%xmm1
mulsd -0x400(%rbx),%xmm1
addsd %xmm0,%xmm1
movsd %xmm1,(%rdx,%rax,8)
movsd 0x400(%rcs,%rdi,8),%xmm0
mulsd (%rbx),%xmm0
addsd %xmm1,%xmm0
movsd %xmm0,(%rdx,%rax,8)
add [=11=]x800,%rbx
add [=11=]x2,%rdi
jne 90
add [=11=]x1,%rsi
add [=11=]x8,%r13
cmp [=11=]x80,%rsi
jne 70
add [=11=]x1,%r10
add [=11=]x400,%rcx
cmp [=11=]x80,%r10
jne 60
add [=11=]xffffffffffffffff,%r12
jne 40
以下是未执行重新分配的 perf stat 的代表性屏幕截图:
Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1 2181531 ns 2180896 ns 322
mat_bench_fixture/impl1 2188280 ns 2186860 ns 322
mat_bench_fixture/impl1 2182988 ns 2182150 ns 322
mat_bench_fixture/impl1 2182715 ns 2182025 ns 322
mat_bench_fixture/impl1 2175719 ns 2175653 ns 322
mat_bench_fixture/impl1_mean 2182246 ns 2181517 ns 322
mat_bench_fixture/impl1_median 2182715 ns 2182025 ns 322
mat_bench_fixture/impl1_stddev 4480 ns 4000 ns 322
Performance counter stats for './mult_test --benchmark_repetitions=5':
3771.370173 task-clock (msec) # 0.994 CPUs utilized
223 context-switches # 0.059 K/sec
0 cpu-migrations # 0.000 K/sec
242 page-faults # 0.064 K/sec
15,808,590,474 cycles # 4.192 GHz (61.31%)
20,201,201,797 instructions # 1.28 insn per cycle (69.04%)
1,844,097,332 branches # 488.973 M/sec (69.04%)
358,319 branch-misses # 0.02% of all branches (69.14%)
7,232,957,363 L1-dcache-loads # 1917.859 M/sec (69.24%)
3,774,591,187 L1-dcache-load-misses # 52.19% of all L1-dcache hits (69.35%)
558,507,528 LLC-loads # 148.091 M/sec (69.46%)
93,136 LLC-load-misses # 0.02% of all LL-cache hits (69.47%)
<not supported> L1-icache-loads
736,008 L1-icache-load-misses (69.47%)
7,242,324,412 dTLB-loads # 1920.343 M/sec (69.34%)
581 dTLB-load-misses # 0.00% of all dTLB cache hits (61.50%)
1,582 iTLB-loads # 0.419 K/sec (61.39%)
307 iTLB-load-misses # 19.41% of all iTLB cache hits (61.29%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
3.795924436 seconds time elapsed
这是用于强制重新分配的 perf stat 的代表性屏幕截图:
Running ./mult_test
Run on (4 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
mat_bench_fixture/impl1 1862961 ns 1862919 ns 376
mat_bench_fixture/impl1 1861986 ns 1861947 ns 376
mat_bench_fixture/impl1 1860330 ns 1860305 ns 376
mat_bench_fixture/impl1 1859711 ns 1859652 ns 376
mat_bench_fixture/impl1 1863299 ns 1863273 ns 376
mat_bench_fixture/impl1_mean 1861658 ns 1861619 ns 376
mat_bench_fixture/impl1_median 1861986 ns 1861947 ns 376
mat_bench_fixture/impl1_stddev 1585 ns 1591 ns 376
Performance counter stats for './mult_test --benchmark_repetitions=5':
3724.287293 task-clock (msec) # 0.995 CPUs utilized
11 context-switches # 0.003 K/sec
0 cpu-migrations # 0.000 K/sec
246 page-faults # 0.066 K/sec
15,612,924,579 cycles # 4.192 GHz (61.34%)
23,344,859,019 instructions # 1.50 insn per cycle (69.07%)
2,130,528,330 branches # 572.063 M/sec (69.07%)
331,651 branch-misses # 0.02% of all branches (69.08%)
8,369,233,786 L1-dcache-loads # 2247.204 M/sec (69.18%)
4,206,241,296 L1-dcache-load-misses # 50.26% of all L1-dcache hits (69.29%)
308,687,646 LLC-loads # 82.885 M/sec (69.40%)
94,288 LLC-load-misses # 0.03% of all LL-cache hits (69.50%)
<not supported> L1-icache-loads
475,066 L1-icache-load-misses (69.50%)
8,360,570,315 dTLB-loads # 2244.878 M/sec (69.37%)
364 dTLB-load-misses # 0.00% of all dTLB cache hits (61.53%)
213 iTLB-loads # 0.057 K/sec (61.42%)
144 iTLB-load-misses # 67.61% of all iTLB cache hits (61.32%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
3.743017809 seconds time elapsed
这是一个最小的工作示例,它没有任何外部依赖项,并允许测试内存对齐问题:
#include <random>
#include <chrono>
#include <iostream>
#include <cstdlib>
constexpr long long n = 128;
constexpr size_t alignment = 64;
inline void escape(void* p)
{
asm volatile("" : : "g"(p) : "memory");
}
inline void clobber()
{
asm volatile("" : : : "memory");
}
struct mat_bench_fixture
{
double *matA, *matB, *matC;
mat_bench_fixture()
{
matA = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
matB = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
matC = (double*) aligned_alloc(alignment, sizeof(double) * n * n);
escape(matA);
escape(matB);
escape(matC);
#if 0
free(matA);
free(matB);
free(matC);
escape(matA);
escape(matB);
escape(matC);
matA = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
matB = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
matC = (double*) aligned_alloc(alignment, sizeof(double) *n * n);
escape(matA);
escape(matB);
escape(matC);
#endif
}
~mat_bench_fixture()
{
free(matA);
free(matB);
free(matC);
}
void SetUp()
{
// generate random data
std::mt19937 gen;
std::uniform_real_distribution<double> dis(0, 1);
for (double* i = matA; i != matA + n * n; ++i)
{
*i = dis(gen);
}
for (double* i = matB; i != matB + n * n; ++i)
{
*i = dis(gen);
}
}
void run()
{
constexpr int iters = 400;
std::chrono::high_resolution_clock timer;
auto start = timer.now();
for (int i = 0; i < iters; ++i)
{
for (long long row = 0; row < n; ++row)
{
for (long long col = 0; col < n; ++col)
{
matC[row * n + col] = 0;
for (long long k = 0; k < n; ++k)
{
matC[row * n + col] += matA[row * n + k] * matB[k * n + col];
}
}
}
escape(matA);
escape(matB);
escape(matC);
clobber();
}
auto stop = timer.now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(
stop - start)
.count() /
iters
<< std::endl;
}
};
int main()
{
mat_bench_fixture bench;
for (int i = 0; i < 5; ++i)
{
bench.SetUp();
bench.run();
}
}
编译:
g++ -o mult_test mult_test.cpp -std=c++14 -O3
在我的机器上,我可以通过对指针使用不同的对齐方式来重现您的案例。试试这个代码:
mat_bench_fixture() {
matA = new double[n * n + 256];
matB = new double[n * n + 256];
matC = new double[n * n + 256];
// align pointers to 1024
matA = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matA) + 1023)&~1023);
matB = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matB) + 1023)&~1023);
matC = reinterpret_cast<double*>((reinterpret_cast<unsigned long long>(matC) + 1023)&~1023);
// toggle this to toggle alignment offset of matB
// matB += 2;
}
如果我切换这段代码中的注释行,我的机器上有 34% 的差异。
不同的对齐偏移导致不同的时间。您也可以尝试抵消其他 2 个指针。有时差异较小,有时较大,有时没有变化。
这一定是由缓存问题引起的:由于指针的最后一位不同,缓存中会出现不同的冲突模式。由于您的例程是内存密集型的(所有数据都不适合 L1),因此缓存性能非常重要。