std::shared_ptr vs std::make_shared:意外的缓存未命中和分支预测
std::shared_ptr vs std::make_shared: unexpected cache misses and branch prediction
我正在尝试测量 std::shared_ptr
和 std::make_shared
创建的指针的效率。
我有下一个测试代码:
#include <iostream>
#include <memory>
#include <vector>
struct TestClass {
TestClass(int _i) : i(_i) {}
int i = 1;
};
void sum(const std::vector<std::shared_ptr<TestClass>>& v) {
unsigned long long s = 0u;
for(size_t i = 0; i < v.size() - 1; ++i) {
s += v[i]->i * v[i + 1]->i;
}
std::cout << s << '\n';
}
void test_shared_ptr(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::shared_ptr<TestClass>(new TestClass(i)));
}
sum(v);
}
void test_make_shared(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::make_shared<TestClass>(i));
}
sum(v);
}
int main(int argc, char *argv[]) {
size_t n = (argc == 3 ) ? atoi(argv[2]) : 100;
if(atoi(argv[1]) == 1) {
test_shared_ptr(n);
} else {
test_make_shared(n);
}
return 0;
}
编译为g++ -W -Wall -O2 -g -std=c++14 main.cpp -o cache_misses.bin
我 运行 使用 std::shared_ptr
构造函数进行测试并使用 valgrind 检查结果:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 1 100000
==2005== Cachegrind, a cache and branch-prediction profiler
==2005== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2005== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2005== Command: ./cache_misses.bin 1 100000
==2005==
--2005-- warning: L3 cache found, using its data for the LL simulation.
--2005-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2005-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_shared_ptr
18107093611968
==2005==
==2005== I refs: 74,188,102
==2005== I1 misses: 1,806
==2005== LLi misses: 1,696
==2005== I1 miss rate: 0.00%
==2005== LLi miss rate: 0.00%
==2005==
==2005== D refs: 26,099,141 (15,735,722 rd + 10,363,419 wr)
==2005== D1 misses: 392,064 ( 264,583 rd + 127,481 wr)
==2005== LLd misses: 134,416 ( 7,947 rd + 126,469 wr)
==2005== D1 miss rate: 1.5% ( 1.7% + 1.2% )
==2005== LLd miss rate: 0.5% ( 0.1% + 1.2% )
==2005==
==2005== LL refs: 393,870 ( 266,389 rd + 127,481 wr)
==2005== LL misses: 136,112 ( 9,643 rd + 126,469 wr)
==2005== LL miss rate: 0.1% ( 0.0% + 1.2% )
==2005==
==2005== Branches: 12,732,402 (11,526,905 cond + 1,205,497 ind)
==2005== Mispredicts: 16,055 ( 15,481 cond + 574 ind)
==2005== Mispred rate: 0.1% ( 0.1% + 0.0% )
与 std::make_shared
:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 2 100000
==2014== Cachegrind, a cache and branch-prediction profiler
==2014== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2014== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2014== Command: ./cache_misses.bin 2 100000
==2014==
--2014-- warning: L3 cache found, using its data for the LL simulation.
--2014-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2014-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_make_shared
18107093611968
==2014==
==2014== I refs: 41,283,983
==2014== I1 misses: 1,805
==2014== LLi misses: 1,696
==2014== I1 miss rate: 0.00%
==2014== LLi miss rate: 0.00%
==2014==
==2014== D refs: 14,997,474 (8,834,690 rd + 6,162,784 wr)
==2014== D1 misses: 241,781 ( 164,368 rd + 77,413 wr)
==2014== LLd misses: 84,413 ( 7,943 rd + 76,470 wr)
==2014== D1 miss rate: 1.6% ( 1.9% + 1.3% )
==2014== LLd miss rate: 0.6% ( 0.1% + 1.2% )
==2014==
==2014== LL refs: 243,586 ( 166,173 rd + 77,413 wr)
==2014== LL misses: 86,109 ( 9,639 rd + 76,470 wr)
==2014== LL miss rate: 0.2% ( 0.0% + 1.2% )
==2014==
==2014== Branches: 7,031,695 (6,426,222 cond + 605,473 ind)
==2014== Mispredicts: 216,010 ( 15,442 cond + 200,568 ind)
==2014== Mispred rate: 3.1% ( 0.2% + 33.1% )
您可能会看到,当我使用 std::make_shared
时,缓存未命中率和分支预测错误率更高。
我希望 std::make_shared
更有效,因为存储的对象和控制块都位于同一个内存块中。或者至少性能应该是一样的。
我错过了什么?
环境详细信息:
$ g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
cachegrind不就是模拟,不是测量吗? https://valgrind.org/docs/manual/cg-manual.html#branch-sim 说 条件分支是使用 16384 个 2 位饱和计数器的数组预测的。 并且它应该代表 2004 年的典型 desktop/server。
使用 2 位饱和计数器的简单分支预测在现代标准下是一个笑话,即使在 2004 年 CPUs 中,对于高性能 CPUs 也过于简单;根据 https://danluu.com/branch-prediction/. See also https://agner.org/optimize/,Pentium II/III 有一个 2 级自适应 local/global 预测器,每个本地历史条目有 4 位; Agner 的微架构 PDF 在开头有一章是关于分支预测的。
Intel 自 Haswell 使用 IT-TAGE 以来,现代 AMD 也使用先进的分支预测技术。
如果你有几个分支在 valgrind 的模拟中碰巧彼此别名,我不会感到惊讶,导致对运行频率较低的分支的错误预测。
您是否尝试过使用真正的硬件性能计数器?例如在 Linux:
perf stat -d ./cache_misses.bin 2 100000
应该为您提供更真实的真实硬件画面,包括真实的 L1d 未命中率和分支预测未命中率。 branches
和 branch-misses
等性能事件映射到某些特定的硬件计数器,具体取决于 CPU 微体系结构。 perf list
将显示可用的计数器。
我经常在我的 Skylake CPU 上使用 taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u -r 2 ./program_under_test
。
(实际上我通常会忽略分支未命中,因为我经常调整没有不可预测分支的 SIMD 循环,并且可以编程为计算不同事件的硬件计数器数量有限。)
我正在尝试测量 std::shared_ptr
和 std::make_shared
创建的指针的效率。
我有下一个测试代码:
#include <iostream>
#include <memory>
#include <vector>
struct TestClass {
TestClass(int _i) : i(_i) {}
int i = 1;
};
void sum(const std::vector<std::shared_ptr<TestClass>>& v) {
unsigned long long s = 0u;
for(size_t i = 0; i < v.size() - 1; ++i) {
s += v[i]->i * v[i + 1]->i;
}
std::cout << s << '\n';
}
void test_shared_ptr(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::shared_ptr<TestClass>(new TestClass(i)));
}
sum(v);
}
void test_make_shared(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::make_shared<TestClass>(i));
}
sum(v);
}
int main(int argc, char *argv[]) {
size_t n = (argc == 3 ) ? atoi(argv[2]) : 100;
if(atoi(argv[1]) == 1) {
test_shared_ptr(n);
} else {
test_make_shared(n);
}
return 0;
}
编译为g++ -W -Wall -O2 -g -std=c++14 main.cpp -o cache_misses.bin
我 运行 使用 std::shared_ptr
构造函数进行测试并使用 valgrind 检查结果:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 1 100000
==2005== Cachegrind, a cache and branch-prediction profiler
==2005== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2005== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2005== Command: ./cache_misses.bin 1 100000
==2005==
--2005-- warning: L3 cache found, using its data for the LL simulation.
--2005-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2005-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_shared_ptr
18107093611968
==2005==
==2005== I refs: 74,188,102
==2005== I1 misses: 1,806
==2005== LLi misses: 1,696
==2005== I1 miss rate: 0.00%
==2005== LLi miss rate: 0.00%
==2005==
==2005== D refs: 26,099,141 (15,735,722 rd + 10,363,419 wr)
==2005== D1 misses: 392,064 ( 264,583 rd + 127,481 wr)
==2005== LLd misses: 134,416 ( 7,947 rd + 126,469 wr)
==2005== D1 miss rate: 1.5% ( 1.7% + 1.2% )
==2005== LLd miss rate: 0.5% ( 0.1% + 1.2% )
==2005==
==2005== LL refs: 393,870 ( 266,389 rd + 127,481 wr)
==2005== LL misses: 136,112 ( 9,643 rd + 126,469 wr)
==2005== LL miss rate: 0.1% ( 0.0% + 1.2% )
==2005==
==2005== Branches: 12,732,402 (11,526,905 cond + 1,205,497 ind)
==2005== Mispredicts: 16,055 ( 15,481 cond + 574 ind)
==2005== Mispred rate: 0.1% ( 0.1% + 0.0% )
与 std::make_shared
:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 2 100000
==2014== Cachegrind, a cache and branch-prediction profiler
==2014== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2014== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2014== Command: ./cache_misses.bin 2 100000
==2014==
--2014-- warning: L3 cache found, using its data for the LL simulation.
--2014-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2014-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_make_shared
18107093611968
==2014==
==2014== I refs: 41,283,983
==2014== I1 misses: 1,805
==2014== LLi misses: 1,696
==2014== I1 miss rate: 0.00%
==2014== LLi miss rate: 0.00%
==2014==
==2014== D refs: 14,997,474 (8,834,690 rd + 6,162,784 wr)
==2014== D1 misses: 241,781 ( 164,368 rd + 77,413 wr)
==2014== LLd misses: 84,413 ( 7,943 rd + 76,470 wr)
==2014== D1 miss rate: 1.6% ( 1.9% + 1.3% )
==2014== LLd miss rate: 0.6% ( 0.1% + 1.2% )
==2014==
==2014== LL refs: 243,586 ( 166,173 rd + 77,413 wr)
==2014== LL misses: 86,109 ( 9,639 rd + 76,470 wr)
==2014== LL miss rate: 0.2% ( 0.0% + 1.2% )
==2014==
==2014== Branches: 7,031,695 (6,426,222 cond + 605,473 ind)
==2014== Mispredicts: 216,010 ( 15,442 cond + 200,568 ind)
==2014== Mispred rate: 3.1% ( 0.2% + 33.1% )
您可能会看到,当我使用 std::make_shared
时,缓存未命中率和分支预测错误率更高。
我希望 std::make_shared
更有效,因为存储的对象和控制块都位于同一个内存块中。或者至少性能应该是一样的。
我错过了什么?
环境详细信息:
$ g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
cachegrind不就是模拟,不是测量吗? https://valgrind.org/docs/manual/cg-manual.html#branch-sim 说 条件分支是使用 16384 个 2 位饱和计数器的数组预测的。 并且它应该代表 2004 年的典型 desktop/server。
使用 2 位饱和计数器的简单分支预测在现代标准下是一个笑话,即使在 2004 年 CPUs 中,对于高性能 CPUs 也过于简单;根据 https://danluu.com/branch-prediction/. See also https://agner.org/optimize/,Pentium II/III 有一个 2 级自适应 local/global 预测器,每个本地历史条目有 4 位; Agner 的微架构 PDF 在开头有一章是关于分支预测的。
Intel 自 Haswell 使用 IT-TAGE 以来,现代 AMD 也使用先进的分支预测技术。
如果你有几个分支在 valgrind 的模拟中碰巧彼此别名,我不会感到惊讶,导致对运行频率较低的分支的错误预测。
您是否尝试过使用真正的硬件性能计数器?例如在 Linux:
perf stat -d ./cache_misses.bin 2 100000
应该为您提供更真实的真实硬件画面,包括真实的 L1d 未命中率和分支预测未命中率。 branches
和 branch-misses
等性能事件映射到某些特定的硬件计数器,具体取决于 CPU 微体系结构。 perf list
将显示可用的计数器。
我经常在我的 Skylake CPU 上使用 taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u -r 2 ./program_under_test
。
(实际上我通常会忽略分支未命中,因为我经常调整没有不可预测分支的 SIMD 循环,并且可以编程为计算不同事件的硬件计数器数量有限。)