了解 Linux 虚拟内存:valgrind 的 massif 输出显示有和没有 --pages-as-heap 的主要差异
Understanding Linux virtual memory: valgrind's massif output shows major differences with and without --pages-as-heap
我看过这个参数的文档,但是差别真的很大!启用时,一个简单程序(见下文)的内存使用量约为 7 GB,禁用时,报告的使用量约为 160 KB .
top
也显示了大约 7 GB,这有点 证实了 pages-as-heap=yes
的结果。
(我有一个理论,但我不相信它能解释如此巨大的差异,所以 - 寻求帮助)。
让我特别困扰的是,大部分报告的内存使用量都被 std::string
使用,而 what?
从未被打印出来(意思是 - 实际容量非常小)。
我确实需要在分析我的应用程序时使用 pages-as-heap=yes
,我只是想知道如何避免 "false positives"
代码片段:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void run()
{
while (true)
{
std::string s;
s += "aaaaa";
s += "aaaaaaaaaaaaaaa";
s += "bbbbbbbbbb";
s += "cccccccccccccccccccccccccccccccccccccccccccccccccc";
if (s.capacity() > 1024) std::cout << "what?" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
int main()
{
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i ) workers.push_back(std::thread(&run));
workers.back().join();
}
编译:g++ --std=c++11 -fno-inline -g3 -pthread
与pages-as-heap=yes
:
100.00% (7,257,714,688B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.75% (7,239,757,824B) 0x54E0679: mmap (mmap.c:34)
| ->53.63% (3,892,314,112B) 0x545C3CF: new_heap (arena.c:438)
| | ->53.63% (3,892,314,112B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->53.63% (3,892,314,112B) 0x5463248: malloc (malloc.c:2911)
| | ->53.63% (3,892,314,112B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x401252: run() (test.cpp:11)
| | ->53.63% (3,892,314,112B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->53.63% (3,892,314,112B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->53.63% (3,892,314,112B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->53.63% (3,892,314,112B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->53.63% (3,892,314,112B) 0x54E63DB: clone (clone.S:109)
| |
| ->35.14% (2,550,136,832B) 0x545C35B: new_heap (arena.c:427)
| | ->35.14% (2,550,136,832B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->35.14% (2,550,136,832B) 0x5463248: malloc (malloc.c:2911)
| | ->35.14% (2,550,136,832B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x401252: run() (test.cpp:11)
| | ->35.14% (2,550,136,832B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->35.14% (2,550,136,832B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->35.14% (2,550,136,832B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->35.14% (2,550,136,832B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->35.14% (2,550,136,832B) 0x54E63DB: clone (clone.S:109)
| |
| ->10.99% (797,306,880B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->10.99% (797,306,880B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->10.99% (797,306,880B) 0x401353: main (test.cpp:24)
|
->00.25% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
同时 pages-as-heap=no
:
96.38% (159,289B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.99% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->43.99% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->43.99% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->43.99% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->33.46% (55,296B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->33.46% (55,296B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->33.46% (55,296B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->33.46% (55,296B) 0x401353: main (test.cpp:24)
|
->12.12% (20,025B) 0x4EFFE37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.07% (19,950B) 0x401285: run() (test.cpp:14)
| | ->12.07% (19,950B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->12.07% (19,950B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->12.07% (19,950B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->12.07% (19,950B) 0x4EE9C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->12.07% (19,950B) 0x53D06B8: start_thread (pthread_create.c:333)
| | ->12.07% (19,950B) 0x56ED3DB: clone (clone.S:109)
| |
| ->00.05% (75B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.58% (9,216B) 0x40315B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->05.58% (9,216B) 0x402FC2: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) (alloc_traits.h:488)
| ->05.58% (9,216B) 0x402D4B: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::thread::_Impl<std::_Bind_simple<void (*())()> >*, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:616)
| ->05.58% (9,216B) 0x402BDE: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:1090)
| ->05.58% (9,216B) 0x402A76: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > >::shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:316)
| ->05.58% (9,216B) 0x402771: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::allocate_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:594)
| ->05.58% (9,216B) 0x402325: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::make_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (shared_ptr.h:610)
| ->05.58% (9,216B) 0x401F9C: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::thread::_M_make_routine<std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (thread:196)
| ->05.58% (9,216B) 0x401BC4: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->05.58% (9,216B) 0x401353: main (test.cpp:24)
|
->01.24% (2,048B) 0x402C9A: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
->01.24% (2,048B) 0x402AF5: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
->01.24% (2,048B) 0x402928: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
->01.24% (2,048B) 0x40244E: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<std::thread>(std::thread&&) (vector.tcc:412)
->01.24% (2,048B) 0x40206D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<std::thread>(std::thread&&) (vector.tcc:101)
->01.24% (2,048B) 0x401C82: std::vector<std::thread, std::allocator<std::thread> >::push_back(std::thread&&) (stl_vector.h:932)
->01.24% (2,048B) 0x401366: main (test.cpp:24)
请忽略线程的蹩脚处理,这只是一个非常简短的例子。
更新
看来,这与 std::string
完全无关。正如@Lawrence 所建议的,这可以通过简单地在堆上分配一个 int
(使用 new
)来重现。我相信 @Lawrence 非常接近这里的真实答案,引用他的评论(对更多读者来说更容易):
劳伦斯:
@KirilKirov The string allocation is not actually taking that much
space... Each thread gets it's initial stack and then heap access maps
some large amount of space (around 70m) that gets inaccurately
reflected. You can measure it by just declaring 1 string and then
having a spin loop... the same virtual memory usage is shown –
Lawrence Sep 28 at 14:51
我:
@Lawrence - you're damn right! OK, so, you're saying (and it appears
to be like this), that on each thread, on the first heap allocation,
the memory manager (or the OS, or whatever) dedicates huge chunk of
memory for the threads' heap needs? And this chunk will be reused
later (or shrinked, if necessary)? – Kiril Kirov Sep 28 at 15:45
劳伦斯:
@KirilKirov something of that nature... exact allocations probably depends on malloc implementation and whatnot – Lawrence 2 days ago
这只是一个 "kind of" 答案(从 Valgrind 的角度来看)。内存池的问题,尤其是 C++ 字符串,已经为人所知有一段时间了。 Valgrind manual 有一个关于 C++ 字符串泄漏的部分,建议您尝试设置 GLIBCXX_FORCE_NEW 环境变量。
此外,对于 GCC6 及更高版本,Valgrind 添加了挂钩以清理 libstdc++ 中仍可访问的内存。 Valgrind bugzilla 条目是 here and the GCC one is here.
我不明白为什么这么小的分配会爆炸到这么多 GB(64 位可执行文件、CentOS 6.6、GCC 6.2 超过 12 GB)。
massif
和 --pages-as-heap=yes
以及您正在观察的 top
列都测量进程使用的虚拟内存。此虚拟内存包括所有 space mmap
'd 在执行 malloc 和创建线程期间。例如,线程的默认堆栈大小将为 8192k
,这反映在每个线程的创建中并有助于虚拟内存占用。
具体的分配方案将取决于实现,但似乎新线程上的第一个堆分配将 mmap
大约 65 兆字节 space。这可以通过查看进程的 pmap 输出来查看。
摘自与示例非常相似的程序:
75170: ./a.out
0000000000400000 24K r-x-- a.out
0000000000605000 4K r---- a.out
0000000000606000 4K rw--- a.out
0000000001b6a000 200K rw--- [ anon ]
00007f669dfa4000 4K ----- [ anon ]
00007f669dfa5000 8192K rw--- [ anon ]
00007f669e7a5000 4K ----- [ anon ]
00007f669e7a6000 8192K rw--- [ anon ]
00007f669efa6000 4K ----- [ anon ]
00007f669efa7000 8192K rw--- [ anon ]
...
00007f66cb800000 8192K rw--- [ anon ]
00007f66cc000000 132K rw--- [ anon ]
00007f66cc021000 65404K ----- [ anon ]
00007f66d0000000 132K rw--- [ anon ]
00007f66d0021000 65404K ----- [ anon ]
00007f66d4000000 132K rw--- [ anon ]
00007f66d4021000 65404K ----- [ anon ]
...
00007f6880586000 8192K rw--- [ anon ]
00007f6880d86000 1056K r-x-- libm-2.23.so
00007f6880e8e000 2044K ----- libm-2.23.so
...
00007f6881c08000 4K r---- libpthread-2.23.so
00007f6881c09000 4K rw--- libpthread-2.23.so
00007f6881c0a000 16K rw--- [ anon ]
00007f6881c0e000 152K r-x-- ld-2.23.so
00007f6881e09000 24K rw--- [ anon ]
00007f6881e33000 4K r---- ld-2.23.so
00007f6881e34000 4K rw--- ld-2.23.so
00007f6881e35000 4K rw--- [ anon ]
00007ffe9d75b000 132K rw--- [ stack ]
00007ffe9d7f8000 12K r---- [ anon ]
00007ffe9d7fb000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 7815008K
当您接近每个进程的虚拟内存阈值时,malloc 似乎变得更加保守。另外,我关于单独映射库的评论被误导了(它们应该按进程共享)
查看文档:
--pages-as-heap= [default: no]
Tells Massif to profile memory at the page level rather than at the malloc'd block level. See above for details.
因此,根据文档,更改此设置会改变测量的内容;不是分配的。
如果是,则您正在测量页数。
如果否,您正在测量 malloc 块。
我会试着写一个简短的总结我学到的东西,同时试图弄清楚发生了什么。
注意:感谢@Lawrence,这个答案是可能的 - 感谢!
长话短说
这与 Linux/kernel(虚拟)内存管理完全无关,也与 std::string
.
无关
这都是关于 glibc
的内存分配器 - 它只是在第一次(当然不仅仅是)动态分配(每个线程)时分配大量内存区域。
详情
#include <thread>
#include <vector>
#include <chrono>
int main() {
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i )
workers.emplace_back([]{
const auto x = std::make_unique<int>(rand());
while (true) std::this_thread::sleep_for(std::chrono::seconds(1));});
workers.back().join();
}
请忽略线程的蹩脚处理,我希望它尽可能短。
命令
编译:g++ --std=c++14 -fno-inline -g3 -O0 -pthread test.cpp
.
简介:valgrind --tool=massif --pages-as-heap=[no|yes] ./a.out
内存使用情况
top
显示 7'815'012
KiB 虚拟内存。
pmap
还显示 7'815'016
KiB 虚拟内存。
massif
和 pages-as-heap=yes
显示了类似的结果:7'817'088
KiB,见下文。
另一方面,massif
与 pages-as-heap=no
截然不同——大约 133 KiB!
Massif 输出 pages-as-heap=yes
杀死程序前的内存使用情况:
100.00% (8,004,698,112B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.78% (7,986,741,248B) 0x54E0679: mmap (mmap.c:34)
| ->46.11% (3,690,987,520B) 0x545C3CF: new_heap (arena.c:438)
| | ->46.11% (3,690,987,520B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->46.11% (3,690,987,520B) 0x5463248: malloc (malloc.c:2911)
| | ->46.11% (3,690,987,520B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->46.11% (3,690,987,520B) 0x400EC5: main::{lambda()
| | ->46.11% (3,690,987,520B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->46.11% (3,690,987,520B) 0x54E63DB: clone (clone.S:109)
| |
| ->33.53% (2,684,354,560B) 0x545C35B: new_heap (arena.c:427)
| | ->33.53% (2,684,354,560B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->33.53% (2,684,354,560B) 0x5463248: malloc (malloc.c:2911)
| | ->33.53% (2,684,354,560B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->33.53% (2,684,354,560B) 0x400EC5: main::{lambda()
| | ->33.53% (2,684,354,560B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->33.53% (2,684,354,560B) 0x54E63DB: clone (clone.S:109)
| |
| ->20.13% (1,611,399,168B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->20.13% (1,611,399,168B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x40139A: std::thread::thread<main::{lambda()
| ->20.13% (1,611,399,168B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->20.13% (1,611,399,168B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->19.19% (1,535,864,832B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->19.19% (1,535,864,832B) 0x400F47: main (test.cpp:10)
| |
| ->00.94% (75,534,336B) in 1+ places, all below ms_print's threshold (01.00%)
|
->00.22% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
Massif 输出 pages-as-heap=no
杀死程序前的内存使用情况:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
68 2,793,125 143,280 136,676 6,604 0
95.39% (136,676B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->50.74% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->50.74% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->50.74% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->50.74% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->36.58% (52,416B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->36.58% (52,416B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->36.58% (52,416B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x40139A: std::thread::thread<main::{lambda()
| ->36.58% (52,416B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->36.58% (52,416B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->34.77% (49,824B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->34.77% (49,824B) 0x400F47: main (test.cpp:10)
| |
| ->01.81% (2,592B) 0x4010FF: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.81% (2,592B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.81% (2,592B) 0x400F47: main (test.cpp:10)
|
->06.13% (8,784B) 0x401B4B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401A60: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40194D: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401894: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40183A: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4017C7: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4016AB: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40155E: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401374: std::thread::thread<main::{lambda()
| ->06.13% (8,784B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->06.13% (8,784B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->05.83% (8,352B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->05.83% (8,352B) 0x400F47: main (test.cpp:10)
| |
| ->00.30% (432B) in 1+ places, all below ms_print's threshold (01.00%)
|
->01.43% (2,048B) 0x403432: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->01.43% (2,048B) 0x4032CF: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
| ->01.43% (2,048B) 0x4030B8: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
| ->01.43% (2,048B) 0x4010B6: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.43% (2,048B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.43% (2,048B) 0x400F47: main (test.cpp:10)
|
->00.51% (724B) in 1+ places, all below ms_print's threshold (01.00%)
发生了什么怪事?
pages-as-heap=no
在 pages-as-heap=no
中,事情看起来很合理 - 让我们不检查它。正如预期的那样,一切都以 malloc/new/new[]
结束,内存使用量很小,我们不用担心 - 这些是高级分配。
pages-as-heap=yes
但看到pages-as-heap=yes
? ~8GiB 虚拟内存用这个简单的代码?
让我们检查堆栈跟踪。
pthread_create
让我们从更简单的开始:以 pthread_create
结尾的那个。
massif
报告分配的内存 1,611,399,168
字节 - 这正好是 192 * 8'196 KiB,意思是 - 192 个线程 * 8MiB,which is the default max stack size of a thread in Linux。
注意,8'196 KiB 不完全是 8 MiB (8'192 KiB)。我不知道这种差异从何而来,但目前并不显着。
std::make_unique<int>
好的,现在让我们看看另外两个堆栈...等等,它们完全一样?是的,massif
的文档解释了这个,我没有完全理解它,但它也不重要。它们显示完全相同的堆栈。让我们结合结果一起检验。
这两个堆栈的内存使用量加起来是 6'375'342'080
字节,所有这些都是由我们简单的 std::make_unique<int>
!
引起的
让我们退后一步:如果我们 运行 同样的实验,但使用一个简单的线程,我们将看到,这个 int
分配导致分配 67'108'864
字节的内存,正好是 64 MB。会发生什么??
这一切都归结为 malloc
的实现(众所周知,new/new[]
在内部默认使用 malloc
.. 实现)。
malloc
内部使用内存分配器,称为 ptmalloc2
- Linux 中的默认内存分配器,它支持线程。
简单地说,这个分配器处理以下条款:
per thread arena
:巨大的内存区域;出于性能原因,通常是每个线程;并非所有软件线程都有自己的 per-thread-arenas,这通常取决于硬件线程的数量(我猜还有其他细节);
heap
:arena
分堆;
chunks
:heap
被分成更小的内存区域,称为chunks
。
关于这些东西有很多细节,稍后会 post 一些有趣的链接,虽然这应该足以让 reader 自己研究 - 这些真的很低-层次和深层次的东西,跟C++内存管理有关。
那么,让我们回到我们的单线程测试 - 为单个 int
分配 64 MiB?让我们再次查看堆栈跟踪并集中在它的末尾:
mmap (mmap.c:34)
new_heap (arena.c:438)
arena_get2.part.3 (arena.c:646)
malloc (malloc.c:2911)
惊喜,惊喜:malloc
调用arena_get2
,调用new_heap
,这导致我们mmap
(mmap
和brk
是低级系统调用,用于 Linux 中的内存分配)。据报道,这恰好分配了 64 MiB 内存。
好的,现在让我们回到我们最初的例子,有 192 个线程和我们巨大的数字 6'375'342'080
- 这 正好 95 * 64 MiB!
为什么是 95 - 我真的不能说,我停止挖掘了,但事实上,这个大数字可以整除 64 MiB 对我来说已经足够了。
如有必要,您可以深入挖掘。
有用的链接
非常棒的解释性文章:Understanding glibc malloc,作者:sploitfun
更多 formal/official 文档:The GNU allocator
一个很酷的堆栈交换问题:How does glibc malloc works
其他:
- https://zgqallen.github.io/2017/08/03/linux-glic-mm-overview/
- https://en.wikipedia.org/wiki/C_dynamic_memory_allocation
- https://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
- heap management
如果在阅读本文时其中一些链接已损坏 post,则应该很容易找到类似的文章。这个话题很受欢迎,如果你知道要寻找什么以及如何寻找。
谢谢
我希望这些观察能够对整个画面进行良好的高级描述,并为进一步的扩展研究提供足够的食物。
随时评论/(建议)编辑/更正/扩展/等
我看过这个参数的文档,但是差别真的很大!启用时,一个简单程序(见下文)的内存使用量约为 7 GB,禁用时,报告的使用量约为 160 KB .
top
也显示了大约 7 GB,这有点 证实了 pages-as-heap=yes
的结果。
(我有一个理论,但我不相信它能解释如此巨大的差异,所以 - 寻求帮助)。
让我特别困扰的是,大部分报告的内存使用量都被 std::string
使用,而 what?
从未被打印出来(意思是 - 实际容量非常小)。
我确实需要在分析我的应用程序时使用 pages-as-heap=yes
,我只是想知道如何避免 "false positives"
代码片段:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void run()
{
while (true)
{
std::string s;
s += "aaaaa";
s += "aaaaaaaaaaaaaaa";
s += "bbbbbbbbbb";
s += "cccccccccccccccccccccccccccccccccccccccccccccccccc";
if (s.capacity() > 1024) std::cout << "what?" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
int main()
{
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i ) workers.push_back(std::thread(&run));
workers.back().join();
}
编译:g++ --std=c++11 -fno-inline -g3 -pthread
与pages-as-heap=yes
:
100.00% (7,257,714,688B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.75% (7,239,757,824B) 0x54E0679: mmap (mmap.c:34)
| ->53.63% (3,892,314,112B) 0x545C3CF: new_heap (arena.c:438)
| | ->53.63% (3,892,314,112B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->53.63% (3,892,314,112B) 0x5463248: malloc (malloc.c:2911)
| | ->53.63% (3,892,314,112B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x401252: run() (test.cpp:11)
| | ->53.63% (3,892,314,112B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->53.63% (3,892,314,112B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->53.63% (3,892,314,112B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->53.63% (3,892,314,112B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->53.63% (3,892,314,112B) 0x54E63DB: clone (clone.S:109)
| |
| ->35.14% (2,550,136,832B) 0x545C35B: new_heap (arena.c:427)
| | ->35.14% (2,550,136,832B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->35.14% (2,550,136,832B) 0x5463248: malloc (malloc.c:2911)
| | ->35.14% (2,550,136,832B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x401252: run() (test.cpp:11)
| | ->35.14% (2,550,136,832B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->35.14% (2,550,136,832B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->35.14% (2,550,136,832B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->35.14% (2,550,136,832B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->35.14% (2,550,136,832B) 0x54E63DB: clone (clone.S:109)
| |
| ->10.99% (797,306,880B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->10.99% (797,306,880B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->10.99% (797,306,880B) 0x401353: main (test.cpp:24)
|
->00.25% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
同时 pages-as-heap=no
:
96.38% (159,289B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.99% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->43.99% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->43.99% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->43.99% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->33.46% (55,296B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->33.46% (55,296B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->33.46% (55,296B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->33.46% (55,296B) 0x401353: main (test.cpp:24)
|
->12.12% (20,025B) 0x4EFFE37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.07% (19,950B) 0x401285: run() (test.cpp:14)
| | ->12.07% (19,950B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->12.07% (19,950B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->12.07% (19,950B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->12.07% (19,950B) 0x4EE9C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->12.07% (19,950B) 0x53D06B8: start_thread (pthread_create.c:333)
| | ->12.07% (19,950B) 0x56ED3DB: clone (clone.S:109)
| |
| ->00.05% (75B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.58% (9,216B) 0x40315B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->05.58% (9,216B) 0x402FC2: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) (alloc_traits.h:488)
| ->05.58% (9,216B) 0x402D4B: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::thread::_Impl<std::_Bind_simple<void (*())()> >*, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:616)
| ->05.58% (9,216B) 0x402BDE: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:1090)
| ->05.58% (9,216B) 0x402A76: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > >::shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:316)
| ->05.58% (9,216B) 0x402771: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::allocate_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:594)
| ->05.58% (9,216B) 0x402325: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::make_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (shared_ptr.h:610)
| ->05.58% (9,216B) 0x401F9C: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::thread::_M_make_routine<std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (thread:196)
| ->05.58% (9,216B) 0x401BC4: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->05.58% (9,216B) 0x401353: main (test.cpp:24)
|
->01.24% (2,048B) 0x402C9A: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
->01.24% (2,048B) 0x402AF5: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
->01.24% (2,048B) 0x402928: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
->01.24% (2,048B) 0x40244E: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<std::thread>(std::thread&&) (vector.tcc:412)
->01.24% (2,048B) 0x40206D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<std::thread>(std::thread&&) (vector.tcc:101)
->01.24% (2,048B) 0x401C82: std::vector<std::thread, std::allocator<std::thread> >::push_back(std::thread&&) (stl_vector.h:932)
->01.24% (2,048B) 0x401366: main (test.cpp:24)
请忽略线程的蹩脚处理,这只是一个非常简短的例子。
更新
看来,这与 std::string
完全无关。正如@Lawrence 所建议的,这可以通过简单地在堆上分配一个 int
(使用 new
)来重现。我相信 @Lawrence 非常接近这里的真实答案,引用他的评论(对更多读者来说更容易):
劳伦斯:
@KirilKirov The string allocation is not actually taking that much space... Each thread gets it's initial stack and then heap access maps some large amount of space (around 70m) that gets inaccurately reflected. You can measure it by just declaring 1 string and then having a spin loop... the same virtual memory usage is shown – Lawrence Sep 28 at 14:51
我:
@Lawrence - you're damn right! OK, so, you're saying (and it appears to be like this), that on each thread, on the first heap allocation, the memory manager (or the OS, or whatever) dedicates huge chunk of memory for the threads' heap needs? And this chunk will be reused later (or shrinked, if necessary)? – Kiril Kirov Sep 28 at 15:45
劳伦斯:
@KirilKirov something of that nature... exact allocations probably depends on malloc implementation and whatnot – Lawrence 2 days ago
这只是一个 "kind of" 答案(从 Valgrind 的角度来看)。内存池的问题,尤其是 C++ 字符串,已经为人所知有一段时间了。 Valgrind manual 有一个关于 C++ 字符串泄漏的部分,建议您尝试设置 GLIBCXX_FORCE_NEW 环境变量。
此外,对于 GCC6 及更高版本,Valgrind 添加了挂钩以清理 libstdc++ 中仍可访问的内存。 Valgrind bugzilla 条目是 here and the GCC one is here.
我不明白为什么这么小的分配会爆炸到这么多 GB(64 位可执行文件、CentOS 6.6、GCC 6.2 超过 12 GB)。
massif
和 --pages-as-heap=yes
以及您正在观察的 top
列都测量进程使用的虚拟内存。此虚拟内存包括所有 space mmap
'd 在执行 malloc 和创建线程期间。例如,线程的默认堆栈大小将为 8192k
,这反映在每个线程的创建中并有助于虚拟内存占用。
具体的分配方案将取决于实现,但似乎新线程上的第一个堆分配将 mmap
大约 65 兆字节 space。这可以通过查看进程的 pmap 输出来查看。
摘自与示例非常相似的程序:
75170: ./a.out
0000000000400000 24K r-x-- a.out
0000000000605000 4K r---- a.out
0000000000606000 4K rw--- a.out
0000000001b6a000 200K rw--- [ anon ]
00007f669dfa4000 4K ----- [ anon ]
00007f669dfa5000 8192K rw--- [ anon ]
00007f669e7a5000 4K ----- [ anon ]
00007f669e7a6000 8192K rw--- [ anon ]
00007f669efa6000 4K ----- [ anon ]
00007f669efa7000 8192K rw--- [ anon ]
...
00007f66cb800000 8192K rw--- [ anon ]
00007f66cc000000 132K rw--- [ anon ]
00007f66cc021000 65404K ----- [ anon ]
00007f66d0000000 132K rw--- [ anon ]
00007f66d0021000 65404K ----- [ anon ]
00007f66d4000000 132K rw--- [ anon ]
00007f66d4021000 65404K ----- [ anon ]
...
00007f6880586000 8192K rw--- [ anon ]
00007f6880d86000 1056K r-x-- libm-2.23.so
00007f6880e8e000 2044K ----- libm-2.23.so
...
00007f6881c08000 4K r---- libpthread-2.23.so
00007f6881c09000 4K rw--- libpthread-2.23.so
00007f6881c0a000 16K rw--- [ anon ]
00007f6881c0e000 152K r-x-- ld-2.23.so
00007f6881e09000 24K rw--- [ anon ]
00007f6881e33000 4K r---- ld-2.23.so
00007f6881e34000 4K rw--- ld-2.23.so
00007f6881e35000 4K rw--- [ anon ]
00007ffe9d75b000 132K rw--- [ stack ]
00007ffe9d7f8000 12K r---- [ anon ]
00007ffe9d7fb000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 7815008K
当您接近每个进程的虚拟内存阈值时,malloc 似乎变得更加保守。另外,我关于单独映射库的评论被误导了(它们应该按进程共享)
查看文档:
--pages-as-heap= [default: no] Tells Massif to profile memory at the page level rather than at the malloc'd block level. See above for details.
因此,根据文档,更改此设置会改变测量的内容;不是分配的。
如果是,则您正在测量页数。 如果否,您正在测量 malloc 块。
我会试着写一个简短的总结我学到的东西,同时试图弄清楚发生了什么。
注意:感谢@Lawrence,这个答案是可能的 - 感谢!
长话短说
这与 Linux/kernel(虚拟)内存管理完全无关,也与 std::string
.
无关
这都是关于 glibc
的内存分配器 - 它只是在第一次(当然不仅仅是)动态分配(每个线程)时分配大量内存区域。
详情
#include <thread>
#include <vector>
#include <chrono>
int main() {
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i )
workers.emplace_back([]{
const auto x = std::make_unique<int>(rand());
while (true) std::this_thread::sleep_for(std::chrono::seconds(1));});
workers.back().join();
}
请忽略线程的蹩脚处理,我希望它尽可能短。
命令
编译:g++ --std=c++14 -fno-inline -g3 -O0 -pthread test.cpp
.
简介:valgrind --tool=massif --pages-as-heap=[no|yes] ./a.out
内存使用情况
top
显示 7'815'012
KiB 虚拟内存。
pmap
还显示 7'815'016
KiB 虚拟内存。
massif
和 pages-as-heap=yes
显示了类似的结果:7'817'088
KiB,见下文。
另一方面,massif
与 pages-as-heap=no
截然不同——大约 133 KiB!
Massif 输出 pages-as-heap=yes
杀死程序前的内存使用情况:
100.00% (8,004,698,112B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.78% (7,986,741,248B) 0x54E0679: mmap (mmap.c:34)
| ->46.11% (3,690,987,520B) 0x545C3CF: new_heap (arena.c:438)
| | ->46.11% (3,690,987,520B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->46.11% (3,690,987,520B) 0x5463248: malloc (malloc.c:2911)
| | ->46.11% (3,690,987,520B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->46.11% (3,690,987,520B) 0x400EC5: main::{lambda()
| | ->46.11% (3,690,987,520B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->46.11% (3,690,987,520B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->46.11% (3,690,987,520B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->46.11% (3,690,987,520B) 0x54E63DB: clone (clone.S:109)
| |
| ->33.53% (2,684,354,560B) 0x545C35B: new_heap (arena.c:427)
| | ->33.53% (2,684,354,560B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->33.53% (2,684,354,560B) 0x5463248: malloc (malloc.c:2911)
| | ->33.53% (2,684,354,560B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| | ->33.53% (2,684,354,560B) 0x400EC5: main::{lambda()
| | ->33.53% (2,684,354,560B) 0x40225C: void std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402194: std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| | ->33.53% (2,684,354,560B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->33.53% (2,684,354,560B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->33.53% (2,684,354,560B) 0x54E63DB: clone (clone.S:109)
| |
| ->20.13% (1,611,399,168B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->20.13% (1,611,399,168B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->20.13% (1,611,399,168B) 0x40139A: std::thread::thread<main::{lambda()
| ->20.13% (1,611,399,168B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->20.13% (1,611,399,168B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->19.19% (1,535,864,832B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->19.19% (1,535,864,832B) 0x400F47: main (test.cpp:10)
| |
| ->00.94% (75,534,336B) in 1+ places, all below ms_print's threshold (01.00%)
|
->00.22% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
Massif 输出 pages-as-heap=no
杀死程序前的内存使用情况:
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
68 2,793,125 143,280 136,676 6,604 0
95.39% (136,676B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->50.74% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->50.74% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->50.74% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->50.74% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->36.58% (52,416B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->36.58% (52,416B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->36.58% (52,416B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->36.58% (52,416B) 0x40139A: std::thread::thread<main::{lambda()
| ->36.58% (52,416B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->36.58% (52,416B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->34.77% (49,824B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->34.77% (49,824B) 0x400F47: main (test.cpp:10)
| |
| ->01.81% (2,592B) 0x4010FF: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.81% (2,592B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.81% (2,592B) 0x400F47: main (test.cpp:10)
|
->06.13% (8,784B) 0x401B4B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401A60: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40194D: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401894: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40183A: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4017C7: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x4016AB: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x40155E: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401374: std::thread::thread<main::{lambda()
| ->06.13% (8,784B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
| ->06.13% (8,784B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
| ->05.83% (8,352B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| | ->05.83% (8,352B) 0x400F47: main (test.cpp:10)
| |
| ->00.30% (432B) in 1+ places, all below ms_print's threshold (01.00%)
|
->01.43% (2,048B) 0x403432: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->01.43% (2,048B) 0x4032CF: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
| ->01.43% (2,048B) 0x4030B8: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
| ->01.43% (2,048B) 0x4010B6: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
| ->01.43% (2,048B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
| ->01.43% (2,048B) 0x400F47: main (test.cpp:10)
|
->00.51% (724B) in 1+ places, all below ms_print's threshold (01.00%)
发生了什么怪事?
pages-as-heap=no
在 pages-as-heap=no
中,事情看起来很合理 - 让我们不检查它。正如预期的那样,一切都以 malloc/new/new[]
结束,内存使用量很小,我们不用担心 - 这些是高级分配。
pages-as-heap=yes
但看到pages-as-heap=yes
? ~8GiB 虚拟内存用这个简单的代码?
让我们检查堆栈跟踪。
pthread_create
让我们从更简单的开始:以 pthread_create
结尾的那个。
massif
报告分配的内存 1,611,399,168
字节 - 这正好是 192 * 8'196 KiB,意思是 - 192 个线程 * 8MiB,which is the default max stack size of a thread in Linux。
注意,8'196 KiB 不完全是 8 MiB (8'192 KiB)。我不知道这种差异从何而来,但目前并不显着。
std::make_unique<int>
好的,现在让我们看看另外两个堆栈...等等,它们完全一样?是的,massif
的文档解释了这个,我没有完全理解它,但它也不重要。它们显示完全相同的堆栈。让我们结合结果一起检验。
这两个堆栈的内存使用量加起来是 6'375'342'080
字节,所有这些都是由我们简单的 std::make_unique<int>
!
让我们退后一步:如果我们 运行 同样的实验,但使用一个简单的线程,我们将看到,这个 int
分配导致分配 67'108'864
字节的内存,正好是 64 MB。会发生什么??
这一切都归结为 malloc
的实现(众所周知,new/new[]
在内部默认使用 malloc
.. 实现)。
malloc
内部使用内存分配器,称为 ptmalloc2
- Linux 中的默认内存分配器,它支持线程。
简单地说,这个分配器处理以下条款:
per thread arena
:巨大的内存区域;出于性能原因,通常是每个线程;并非所有软件线程都有自己的 per-thread-arenas,这通常取决于硬件线程的数量(我猜还有其他细节);heap
:arena
分堆;chunks
:heap
被分成更小的内存区域,称为chunks
。
关于这些东西有很多细节,稍后会 post 一些有趣的链接,虽然这应该足以让 reader 自己研究 - 这些真的很低-层次和深层次的东西,跟C++内存管理有关。
那么,让我们回到我们的单线程测试 - 为单个 int
分配 64 MiB?让我们再次查看堆栈跟踪并集中在它的末尾:
mmap (mmap.c:34)
new_heap (arena.c:438)
arena_get2.part.3 (arena.c:646)
malloc (malloc.c:2911)
惊喜,惊喜:malloc
调用arena_get2
,调用new_heap
,这导致我们mmap
(mmap
和brk
是低级系统调用,用于 Linux 中的内存分配)。据报道,这恰好分配了 64 MiB 内存。
好的,现在让我们回到我们最初的例子,有 192 个线程和我们巨大的数字 6'375'342'080
- 这 正好 95 * 64 MiB!
为什么是 95 - 我真的不能说,我停止挖掘了,但事实上,这个大数字可以整除 64 MiB 对我来说已经足够了。
如有必要,您可以深入挖掘。
有用的链接
非常棒的解释性文章:Understanding glibc malloc,作者:sploitfun
更多 formal/official 文档:The GNU allocator
一个很酷的堆栈交换问题:How does glibc malloc works
其他:
- https://zgqallen.github.io/2017/08/03/linux-glic-mm-overview/
- https://en.wikipedia.org/wiki/C_dynamic_memory_allocation
- https://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
- heap management
如果在阅读本文时其中一些链接已损坏 post,则应该很容易找到类似的文章。这个话题很受欢迎,如果你知道要寻找什么以及如何寻找。
谢谢
我希望这些观察能够对整个画面进行良好的高级描述,并为进一步的扩展研究提供足够的食物。
随时评论/(建议)编辑/更正/扩展/等