了解 Linux 虚拟内存:valgrind 的 massif 输出显示有和没有 --pages-as-heap 的主要差异

Understanding Linux virtual memory: valgrind's massif output shows major differences with and without --pages-as-heap

我看过这个参数的文档,但是差别真的很大!启用时,一个简单程序(见下文)的内存使用量约为 7 GB,禁用时,报告的使用量约为 160 KB .

top 也显示了大约 7 GB,这有点 证实了 pages-as-heap=yes 的结果。

(我有一个理论,但我不相信它能解释如此巨大的差异,所以 - 寻求帮助)。

让我特别困扰的是,大部分报告的内存使用量都被 std::string 使用,而 what? 从未被打印出来(意思是 - 实际容量非常小)。

我确实需要在分析我的应用程序时使用 pages-as-heap=yes,我只是想知道如何避免 "false positives"


代码片段:

#include <iostream>
#include <thread>
#include <vector>
#include <chrono>

void run()
{
    while (true)
    {
        std::string s;
        s += "aaaaa";
        s += "aaaaaaaaaaaaaaa";
        s += "bbbbbbbbbb";
        s += "cccccccccccccccccccccccccccccccccccccccccccccccccc";
        if (s.capacity() > 1024) std::cout << "what?" << std::endl;

        std::this_thread::sleep_for(std::chrono::seconds(1));
    }
}

int main()
{
    std::vector<std::thread> workers;
    for( unsigned i = 0; i < 192; ++i ) workers.push_back(std::thread(&run));

    workers.back().join();
}

编译:g++ --std=c++11 -fno-inline -g3 -pthread

pages-as-heap=yes:

100.00% (7,257,714,688B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.75% (7,239,757,824B) 0x54E0679: mmap (mmap.c:34)
| ->53.63% (3,892,314,112B) 0x545C3CF: new_heap (arena.c:438)
| | ->53.63% (3,892,314,112B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| |   ->53.63% (3,892,314,112B) 0x5463248: malloc (malloc.c:2911)
| |     ->53.63% (3,892,314,112B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |       ->53.63% (3,892,314,112B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |         ->53.63% (3,892,314,112B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |           ->53.63% (3,892,314,112B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |             ->53.63% (3,892,314,112B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |               ->53.63% (3,892,314,112B) 0x401252: run() (test.cpp:11)
| |                 ->53.63% (3,892,314,112B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| |                   ->53.63% (3,892,314,112B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| |                     ->53.63% (3,892,314,112B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| |                       ->53.63% (3,892,314,112B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |                         ->53.63% (3,892,314,112B) 0x51C96B8: start_thread (pthread_create.c:333)
| |                           ->53.63% (3,892,314,112B) 0x54E63DB: clone (clone.S:109)
| |                             
| ->35.14% (2,550,136,832B) 0x545C35B: new_heap (arena.c:427)
| | ->35.14% (2,550,136,832B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| |   ->35.14% (2,550,136,832B) 0x5463248: malloc (malloc.c:2911)
| |     ->35.14% (2,550,136,832B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |       ->35.14% (2,550,136,832B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |         ->35.14% (2,550,136,832B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |           ->35.14% (2,550,136,832B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |             ->35.14% (2,550,136,832B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |               ->35.14% (2,550,136,832B) 0x401252: run() (test.cpp:11)
| |                 ->35.14% (2,550,136,832B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| |                   ->35.14% (2,550,136,832B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| |                     ->35.14% (2,550,136,832B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| |                       ->35.14% (2,550,136,832B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |                         ->35.14% (2,550,136,832B) 0x51C96B8: start_thread (pthread_create.c:333)
| |                           ->35.14% (2,550,136,832B) 0x54E63DB: clone (clone.S:109)
| |                             
| ->10.99% (797,306,880B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
|   ->10.99% (797,306,880B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|     ->10.99% (797,306,880B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       ->10.99% (797,306,880B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
|         ->10.99% (797,306,880B) 0x401353: main (test.cpp:24)
|           
->00.25% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)

同时 pages-as-heap=no:

96.38% (159,289B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.99% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->43.99% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
|   ->43.99% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
|     ->43.99% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|       
->33.46% (55,296B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->33.46% (55,296B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
|   ->33.46% (55,296B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|     ->33.46% (55,296B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       ->33.46% (55,296B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
|         ->33.46% (55,296B) 0x401353: main (test.cpp:24)
|           
->12.12% (20,025B) 0x4EFFE37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|   ->12.12% (20,025B) 0x4F00D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|     ->12.12% (20,025B) 0x4F00FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       ->12.07% (19,950B) 0x401285: run() (test.cpp:14)
|       | ->12.07% (19,950B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
|       |   ->12.07% (19,950B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
|       |     ->12.07% (19,950B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
|       |       ->12.07% (19,950B) 0x4EE9C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       |         ->12.07% (19,950B) 0x53D06B8: start_thread (pthread_create.c:333)
|       |           ->12.07% (19,950B) 0x56ED3DB: clone (clone.S:109)
|       |             
|       ->00.05% (75B) in 1+ places, all below ms_print's threshold (01.00%)
|       
->05.58% (9,216B) 0x40315B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->05.58% (9,216B) 0x402FC2: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) (alloc_traits.h:488)
|   ->05.58% (9,216B) 0x402D4B: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::thread::_Impl<std::_Bind_simple<void (*())()> >*, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:616)
|     ->05.58% (9,216B) 0x402BDE: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:1090)
|       ->05.58% (9,216B) 0x402A76: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > >::shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:316)
|         ->05.58% (9,216B) 0x402771: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::allocate_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:594)
|           ->05.58% (9,216B) 0x402325: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::make_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (shared_ptr.h:610)
|             ->05.58% (9,216B) 0x401F9C: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::thread::_M_make_routine<std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (thread:196)
|               ->05.58% (9,216B) 0x401BC4: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
|                 ->05.58% (9,216B) 0x401353: main (test.cpp:24)
|                   
->01.24% (2,048B) 0x402C9A: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
  ->01.24% (2,048B) 0x402AF5: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
    ->01.24% (2,048B) 0x402928: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
      ->01.24% (2,048B) 0x40244E: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<std::thread>(std::thread&&) (vector.tcc:412)
        ->01.24% (2,048B) 0x40206D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<std::thread>(std::thread&&) (vector.tcc:101)
          ->01.24% (2,048B) 0x401C82: std::vector<std::thread, std::allocator<std::thread> >::push_back(std::thread&&) (stl_vector.h:932)
            ->01.24% (2,048B) 0x401366: main (test.cpp:24)

请忽略线程的蹩脚处理,这只是一个非常简短的例子。


更新

看来,这与 std::string 完全无关。正如@Lawrence 所建议的,这可以通过简单地在堆上分配一个 int(使用 new)来重现。我相信 @Lawrence 非常接近这里的真实答案,引用他的评论(对更多读者来说更容易):

劳伦斯:

@KirilKirov The string allocation is not actually taking that much space... Each thread gets it's initial stack and then heap access maps some large amount of space (around 70m) that gets inaccurately reflected. You can measure it by just declaring 1 string and then having a spin loop... the same virtual memory usage is shown – Lawrence Sep 28 at 14:51

我:

@Lawrence - you're damn right! OK, so, you're saying (and it appears to be like this), that on each thread, on the first heap allocation, the memory manager (or the OS, or whatever) dedicates huge chunk of memory for the threads' heap needs? And this chunk will be reused later (or shrinked, if necessary)? – Kiril Kirov Sep 28 at 15:45

劳伦斯:

@KirilKirov something of that nature... exact allocations probably depends on malloc implementation and whatnot – Lawrence 2 days ago

这只是一个 "kind of" 答案(从 Valgrind 的角度来看)。内存池的问题,尤其是 C++ 字符串,已经为人所知有一段时间了。 Valgrind manual 有一个关于 C++ 字符串泄漏的部分,建议您尝试设置 GLIBCXX_FORCE_NEW 环境变量。

此外,对于 GCC6 及更高版本,Valgrind 添加了挂钩以清理 libstdc++ 中仍可访问的内存。 Valgrind bugzilla 条目是 here and the GCC one is here.

我不明白为什么这么小的分配会爆炸到这么多 GB(64 位可执行文件、CentOS 6.6、GCC 6.2 超过 12 GB)。

massif--pages-as-heap=yes 以及您正在观察的 top 列都测量进程使用的虚拟内存。此虚拟内存包括所有 space mmap'd 在执行 malloc 和创建线程期间。例如,线程的默认堆栈大小将为 8192k,这反映在每个线程的创建中并有助于虚拟内存占用。

具体的分配方案将取决于实现,但似乎新线程上的第一个堆分配将 mmap 大约 65 兆字节 space。这可以通过查看进程的 pmap 输出来查看。

摘自与示例非常相似的程序:

75170:   ./a.out
0000000000400000     24K r-x-- a.out
0000000000605000      4K r---- a.out
0000000000606000      4K rw--- a.out
0000000001b6a000    200K rw---   [ anon ]
00007f669dfa4000      4K -----   [ anon ]
00007f669dfa5000   8192K rw---   [ anon ]
00007f669e7a5000      4K -----   [ anon ]
00007f669e7a6000   8192K rw---   [ anon ]
00007f669efa6000      4K -----   [ anon ]
00007f669efa7000   8192K rw---   [ anon ]
...
00007f66cb800000   8192K rw---   [ anon ]
00007f66cc000000    132K rw---   [ anon ]
00007f66cc021000  65404K -----   [ anon ]
00007f66d0000000    132K rw---   [ anon ]
00007f66d0021000  65404K -----   [ anon ]
00007f66d4000000    132K rw---   [ anon ]
00007f66d4021000  65404K -----   [ anon ]
...
00007f6880586000   8192K rw---   [ anon ]
00007f6880d86000   1056K r-x-- libm-2.23.so
00007f6880e8e000   2044K ----- libm-2.23.so
...
00007f6881c08000      4K r---- libpthread-2.23.so
00007f6881c09000      4K rw--- libpthread-2.23.so
00007f6881c0a000     16K rw---   [ anon ]
00007f6881c0e000    152K r-x-- ld-2.23.so
00007f6881e09000     24K rw---   [ anon ]
00007f6881e33000      4K r---- ld-2.23.so
00007f6881e34000      4K rw--- ld-2.23.so
00007f6881e35000      4K rw---   [ anon ]
00007ffe9d75b000    132K rw---   [ stack ]
00007ffe9d7f8000     12K r----   [ anon ]
00007ffe9d7fb000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total          7815008K

当您接近每个进程的虚拟内存阈值时,malloc 似乎变得更加保守。另外,我关于单独映射库的评论被误导了(它们应该按进程共享)

查看文档:

--pages-as-heap= [default: no] Tells Massif to profile memory at the page level rather than at the malloc'd block level. See above for details.

因此,根据文档,更改此设置会改变测量的内容;不是分配的。

如果是,则您正在测量页数。 如果否,您正在测量 malloc 块。

我会试着写一个简短的总结我学到的东西,同时试图弄清楚发生了什么。
注意:感谢@Lawrence,这个答案是可能的 - 感谢!


长话短说

这与 Linux/kernel(虚拟)内存管理完全无关,也与 std::string.
无关 这都是关于 glibc 的内存分配器 - 它只是在第一次(当然不仅仅是)动态分配(每个线程)时分配大量内存区域


详情

MCVE

#include <thread>
#include <vector>
#include <chrono>

int main() {
    std::vector<std::thread> workers;
    for( unsigned i = 0; i < 192; ++i )
        workers.emplace_back([]{
            const auto x = std::make_unique<int>(rand());
            while (true) std::this_thread::sleep_for(std::chrono::seconds(1));});
    workers.back().join();
}

请忽略线程的蹩脚处理,我希望它尽可能短。

命令

编译:g++ --std=c++14 -fno-inline -g3 -O0 -pthread test.cpp.
简介:valgrind --tool=massif --pages-as-heap=[no|yes] ./a.out

内存使用情况

top 显示 7'815'012 KiB 虚拟内存。
pmap 还显示 7'815'016 KiB 虚拟内存。
massifpages-as-heap=yes 显示了类似的结果:7'817'088 KiB,见下文。
另一方面,massifpages-as-heap=no 截然不同——大约 133 KiB!

Massif 输出 pages-as-heap=yes

杀死程序前的内存使用情况:

100.00% (8,004,698,112B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.78% (7,986,741,248B) 0x54E0679: mmap (mmap.c:34)
| ->46.11% (3,690,987,520B) 0x545C3CF: new_heap (arena.c:438)
| | ->46.11% (3,690,987,520B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| |   ->46.11% (3,690,987,520B) 0x5463248: malloc (malloc.c:2911)
| |     ->46.11% (3,690,987,520B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |       ->46.11% (3,690,987,520B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| |         ->46.11% (3,690,987,520B) 0x400EC5: main::{lambda()
| |           ->46.11% (3,690,987,520B) 0x40225C: void std::_Bind_simple<main::{lambda()
| |             ->46.11% (3,690,987,520B) 0x402194: std::_Bind_simple<main::{lambda()
| |               ->46.11% (3,690,987,520B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| |                 ->46.11% (3,690,987,520B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |                   ->46.11% (3,690,987,520B) 0x51C96B8: start_thread (pthread_create.c:333)
| |                     ->46.11% (3,690,987,520B) 0x54E63DB: clone (clone.S:109)
| |                       
| ->33.53% (2,684,354,560B) 0x545C35B: new_heap (arena.c:427)
| | ->33.53% (2,684,354,560B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| |   ->33.53% (2,684,354,560B) 0x5463248: malloc (malloc.c:2911)
| |     ->33.53% (2,684,354,560B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |       ->33.53% (2,684,354,560B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)
| |         ->33.53% (2,684,354,560B) 0x400EC5: main::{lambda()
| |           ->33.53% (2,684,354,560B) 0x40225C: void std::_Bind_simple<main::{lambda()
| |             ->33.53% (2,684,354,560B) 0x402194: std::_Bind_simple<main::{lambda()
| |               ->33.53% (2,684,354,560B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()
| |                 ->33.53% (2,684,354,560B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |                   ->33.53% (2,684,354,560B) 0x51C96B8: start_thread (pthread_create.c:333)
| |                     ->33.53% (2,684,354,560B) 0x54E63DB: clone (clone.S:109)
| |                       
| ->20.13% (1,611,399,168B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
|   ->20.13% (1,611,399,168B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|     ->20.13% (1,611,399,168B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       ->20.13% (1,611,399,168B) 0x40139A: std::thread::thread<main::{lambda()
|         ->20.13% (1,611,399,168B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
|           ->20.13% (1,611,399,168B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
|             ->19.19% (1,535,864,832B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
|             | ->19.19% (1,535,864,832B) 0x400F47: main (test.cpp:10)
|             |   
|             ->00.94% (75,534,336B) in 1+ places, all below ms_print's threshold (01.00%)
|             
->00.22% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)

Massif 输出 pages-as-heap=no

杀死程序前的内存使用情况:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 68      2,793,125          143,280          136,676         6,604            0
95.39% (136,676B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->50.74% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->50.74% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
|   ->50.74% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
|     ->50.74% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|       
->36.58% (52,416B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->36.58% (52,416B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
|   ->36.58% (52,416B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|     ->36.58% (52,416B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
|       ->36.58% (52,416B) 0x40139A: std::thread::thread<main::{lambda()
|         ->36.58% (52,416B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
|           ->36.58% (52,416B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
|             ->34.77% (49,824B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
|             | ->34.77% (49,824B) 0x400F47: main (test.cpp:10)
|             |   
|             ->01.81% (2,592B) 0x4010FF: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
|               ->01.81% (2,592B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
|                 ->01.81% (2,592B) 0x400F47: main (test.cpp:10)
|                   
->06.13% (8,784B) 0x401B4B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
| ->06.13% (8,784B) 0x401A60: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|   ->06.13% (8,784B) 0x40194D: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|     ->06.13% (8,784B) 0x401894: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|       ->06.13% (8,784B) 0x40183A: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|         ->06.13% (8,784B) 0x4017C7: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|           ->06.13% (8,784B) 0x4016AB: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|             ->06.13% (8,784B) 0x40155E: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()
|               ->06.13% (8,784B) 0x401374: std::thread::thread<main::{lambda()
|                 ->06.13% (8,784B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)
|                   ->06.13% (8,784B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)
|                     ->05.83% (8,352B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
|                     | ->05.83% (8,352B) 0x400F47: main (test.cpp:10)
|                     |   
|                     ->00.30% (432B) in 1+ places, all below ms_print's threshold (01.00%)
|                     
->01.43% (2,048B) 0x403432: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->01.43% (2,048B) 0x4032CF: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
|   ->01.43% (2,048B) 0x4030B8: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
|     ->01.43% (2,048B) 0x4010B6: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()
|       ->01.43% (2,048B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()
|         ->01.43% (2,048B) 0x400F47: main (test.cpp:10)
|           
->00.51% (724B) in 1+ places, all below ms_print's threshold (01.00%)

发生了什么怪事?

pages-as-heap=no

pages-as-heap=no 中,事情看起来很合理 - 让我们不检查它。正如预期的那样,一切都以 malloc/new/new[] 结束,内存使用量很小,我们不用担心 - 这些是高级分配。

pages-as-heap=yes

但看到pages-as-heap=yes? ~8GiB 虚拟内存用这个简单的代码?

让我们检查堆栈跟踪。

pthread_create

让我们从更简单的开始:以 pthread_create 结尾的那个。

massif 报告分配的内存 1,611,399,168 字节 - 这正好是 192 * 8'196 KiB,意思是 - 192 个线程 * 8MiB,which is the default max stack size of a thread in Linux

注意,8'196 KiB 不完全是 8 MiB (8'192 KiB)。我不知道这种差异从何而来,但目前并不显着​​。

std::make_unique<int>

好的,现在让我们看看另外两个堆栈...等等,它们完全一样?是的,massif的文档解释了这个,我没有完全理解它,但它也不重要。它们显示完全相同的堆栈。让我们结合结果一起检验。

这两个堆栈的内存使用量加起来是 6'375'342'080 字节,所有这些都是由我们简单的 std::make_unique<int>!

引起的

让我们退后一步:如果我们 运行 同样的实验,但使用一个简单的线程,我们将看到,这个 int 分配导致分配 67'108'864 字节的内存,正好是 64 MB。会发生什么??

这一切都归结为 malloc 的实现(众所周知,new/new[] 在内部默认使用 malloc.. 实现)。

malloc 内部使用内存分配器,称为 ptmalloc2 - Linux 中的默认内存分配器,它支持线程。

简单地说,这个分​​配器处理以下条款:

  • per thread arena:巨大的内存区域;出于性能原因,通常是每个线程;并非所有软件线程都有自己的 per-thread-arenas,这通常取决于硬件线程的数量(我猜还有其他细节);
  • heaparena分堆;
  • chunksheap被分成更小的内存区域,称为chunks

关于这些东西有很多细节,稍后会 post 一些有趣的链接,虽然这应该足以让 reader 自己研究 - 这些真的很低-层次和深层次的东西,跟C++内存管理有关。

那么,让我们回到我们的单线程测试 - 为单个 int 分配 64 MiB?让我们再次查看堆栈跟踪并集中在它的末尾:

mmap (mmap.c:34)
new_heap (arena.c:438)
arena_get2.part.3 (arena.c:646)
malloc (malloc.c:2911)

惊喜,惊喜:malloc调用arena_get2,调用new_heap,这导致我们mmapmmapbrk是低级系统调用,用于 Linux 中的内存分配)。据报道,这恰好分配了 64 MiB 内存。

好的,现在让我们回到我们最初的例子,有 192 个线程和我们巨大的数字 6'375'342'080 - 这 正好 95 * 64 MiB!

为什么是 95 - 我真的不能说,我停止挖掘了,但事实上,这个大数字可以整除 64 MiB 对我来说已经足够了。

如有必要,您可以深入挖掘。

有用的链接

非常棒的解释性文章:Understanding glibc malloc,作者:sploitfun

更多 formal/official 文档:The GNU allocator

一个很酷的堆栈交换问题:How does glibc malloc works

其他:

如果在阅读本文时其中一些链接已损坏 post,则应该很容易找到类似的文章。这个话题很受欢迎,如果你知道要寻找什么以及如何寻找。

谢谢

我希望这些观察能够对整个画面进行良好的高级描述,并为进一步的扩展研究提供足够的食物。

随时评论/(建议)编辑/更正/扩展/等