boost::pool_allocator 比 std::allocator 慢很多

boost::pool_allocator significantly slower than std::allocator

我正在学习内存池,并尝试在我的项目中使用 boost::pool_allocator。根据documment,我做了一个关于时间成本的小测试:

template <typename Alloc>
void test()
{
    using namespace std::chrono;
    auto t0 = high_resolution_clock::now();
    for (int i = 0; i < 1000; ++i) {
        std::vector<int, Alloc> vec;
        for (int j = 0; j < 10000; ++j)
            vec.push_back(i + j);
    }
    auto t1 = high_resolution_clock::now();
    auto time_ms = duration<double>(t1 - t0).count() * 1e3;
    cout << "time cost: " << time_ms << " ms" << endl;
}

int main()
{
    test<std::allocator<int>>();
    test<boost::pool_allocator<int>>();
}

结果是:

time cost: 3.97602 ms
time cost: 91.3943 ms

Boost doc 说:

Pools are generally used when there is a lot of allocation and deallocation of small objects.

所以我预计 boost::pool_allocator 比上面代码中的 std::allocator 花费更少的时间,但测试结果显示它更差。

我使用boost::pool_allocator错了吗?在什么情况下我可以通过使用内存池(或只是 Boost pool/pool_allocator)获得加速?

Pools are generally used when there is a lot of allocation and deallocation of small objects.

vector 不分配“小对象”。 vector<T> 分配 个数组 。或者更具体地说,分配在一块内存中的单个数组足以容纳(至少)sizeof(T) * size 字节。

池分配器在这种分配模式下非常糟糕

供参考:

template <typename T,
    typename UserAllocator = default_user_allocator_new_delete,
    typename Mutex = details::pool::default_mutex,
    unsigned NextSize = 32,
    unsigned MaxSize = 0>
class pool_allocator;

我想也许是锁定造成的。另外,可以有更好的提示。

让我们测试一下! Live On Compiler Explorer

#include <boost/core/demangle.hpp>
#include <boost/pool/pool_alloc.hpp>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std::chrono_literals;
auto static now = std::chrono::high_resolution_clock::now;

template <typename Alloc> void test(int run, Alloc alloc = {}) {
    auto load = [=](bool RESERVE, unsigned ITERATIONS = 1'000, unsigned SIZE = 10'000) {
        for (unsigned i = 0; i < ITERATIONS; ++i) {
            std::vector<int, Alloc> vec(alloc);
            if (RESERVE)
                vec.reserve(SIZE);
            for (unsigned j = 0; j < SIZE; ++j)
                vec.push_back(i + j);
        }
    };

    auto lap_time = [t0 = now()]() mutable {
        return now() - std::exchange(t0, now());
    };

    load(false); auto without_reserve = lap_time() / 1.0ms;
    load(true);  auto with_reserve    = lap_time() / 1.0ms;

    std::cout << "run " << run                                             //
              << " naive:    " << std::setw(7) << without_reserve << "ms"  //
              << " reserved: " << std::setw(7) << with_reserve    << "ms"  //
              << "(" << boost::core::demangle(typeid(Alloc).name()) << ")" //
              << std::endl;
}

void run_tests(int run) {
    test<std::allocator<int>>(run);

    using NullMx    = boost::details::pool::null_mutex;
    using Mx        = boost::details::pool::default_mutex;
    using Malloc    = boost::default_user_allocator_malloc_free;
    using NewDelete = boost::default_user_allocator_new_delete;

    // 
    // no hints
    //
    test<boost::pool_allocator<int, Malloc,    NullMx>>(run);
    test<boost::pool_allocator<int, NewDelete, NullMx>>(run);
    test<boost::pool_allocator<int, Malloc,    Mx>>(run);
    test<boost::pool_allocator<int, NewDelete, Mx>>(run);

    //
    // hinted
    //
    test<boost::pool_allocator<int, Malloc,    NullMx, 1'000, 0>>(run);
    test<boost::pool_allocator<int, NewDelete, NullMx, 1'000, 0>>(run);
    test<boost::pool_allocator<int, Malloc,    Mx,     1'000, 0>>(run);
    test<boost::pool_allocator<int, NewDelete, Mx,     1'000, 0>>(run);
}

int main()
{
    std::cout << std::fixed << std::setprecision(3);

    for (int run : {1,2,3}) {
        auto t0 = now();
        run_tests(run);
        std::cout << " -- Done (" << (now() - t0) / 1.ms << "ms)" << std::endl;
    }
}

Compiler Explorer is showing some real inconsistent spikes; my own machine doesn't:

run 1 naive:      8.025ms reserved:   5.412ms(std::allocator<int>)
run 1 naive:     92.212ms reserved:  31.166ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:     93.466ms reserved:  29.901ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:     92.488ms reserved:  29.883ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 1 naive:     92.450ms reserved:  29.824ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 1 naive:     82.879ms reserved:  27.478ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:     82.775ms reserved:  28.187ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:     83.189ms reserved:  27.404ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 1 naive:     83.159ms reserved:  27.468ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (947.595ms)
run 2 naive:      8.007ms reserved:   5.543ms(std::allocator<int>)
run 2 naive:     92.225ms reserved:  29.882ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:     92.311ms reserved:  29.805ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:     92.601ms reserved:  29.873ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 2 naive:     92.421ms reserved:  30.028ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 2 naive:     83.028ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:     82.822ms reserved:  27.427ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:     83.230ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 2 naive:     83.104ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (944.958ms)
run 3 naive:      8.068ms reserved:   5.422ms(std::allocator<int>)
run 3 naive:     92.282ms reserved:  29.880ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:     92.064ms reserved:  29.960ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:     92.339ms reserved:  29.928ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 3 naive:     92.977ms reserved:  29.890ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 3 naive:     82.906ms reserved:  27.388ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:     82.784ms reserved:  27.585ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:     83.157ms reserved:  28.233ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 3 naive:     83.098ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (945.629ms)

仍然,很明显,总是更慢。让我们来介绍一下

探查器

仅将标准分配器与 boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u> 进行比较:

  • 标准分配器导致对 new/delete 的 48k 次调用,导致对 malloc/free 一样多 次底层调用

  • 池分配器显示数量大大减少:

    以及 malloc/free:

到目前为止一切顺利!什么事要花这么多时间?

其中 unordered_malloc 内联到来自各种 Boost Pool headers 的大量行。从 boost/pool/simple_segregated_storage.hpp 内联了最严重的违规者:(第二列是相对于 parent 的成本百分比):

这些行在 try_malloc_n

template <typename SizeType>
void * simple_segregated_storage<SizeType>::try_malloc_n(
    void * & start, size_type n, const size_type partition_size)
{
  void * iter = nextof(start);
  while (--n != 0)
  {
    void * next = nextof(iter);
    if (next != static_cast<char *>(iter) + partition_size)
    {
      // next == 0 (end-of-list) or non-contiguous chunk found
      start = iter;
      return 0;
    }
    iter = next;
  }
  return iter;
}

其中 self-describes 作为:

The function attempts to find n contiguous chunks of size partition_size in the free list, starting at start. If it succeds, it returns the last chunk in that contiguous sequence, so that the sequence is known by [start, {retval}] If it fails, it does do either because it's at the end of the free list or hits a non-contiguous chunk. In either case, it will return 0, and set start to the last considered chunk. You are at the end of the free list if nextof(start) == 0. Otherwise, start points to the last chunk in the contiguous sequence, and nextof(start) points to the first chunk in the next contiguous sequence (assuming an ordered free list).

确实出现了,那么在这种情况下追逐隔离堆上的空闲块毕竟代价太大了。餐巾纸计算表明 try_malloc_n 占用了我们之前看到的 higher-level unordered_malloc 调用的 99.75%。

SHOCK:替代实施?

在我的调查过程中,我发现了一些可用于获得更多洞察力的定义,例如:

#define NDEBUG
//#define BOOST_POOL_INSTRUMENT 1
//#define BOOST_POOL_VALIDATE 1
//#define BOOST_POOL_VALGRIND 1

现在,我使用 VALIDATE/INSTRUMENT 观察预期效果(非常冗长的输出和轻微的性能下降)。

从阅读 name/code 我预计 BOOST_POOL_VALGRIND 会类似地降低性能(毕竟,当 运行ning Valgrind 时,它可能应该做额外的工作来避免误报内存错误, 正确的?)。令我惊讶的是,定义它使整个事情 运行 快如闪电Live On Compiler Explorer

run 1 naive:      8.166ms reserved:   5.267ms(std::allocator<int>)
run 1 naive:      9.713ms reserved:   5.267ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:      8.853ms reserved:   5.226ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:      8.990ms reserved:   5.282ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 1 naive:      8.899ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 1 naive:      8.620ms reserved:   5.237ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:      8.622ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:      8.963ms reserved:   5.257ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 1 naive:      8.990ms reserved:   5.271ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (127.276ms)
run 2 naive:      7.965ms reserved:   5.208ms(std::allocator<int>)
run 2 naive:      8.503ms reserved:   5.236ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:      8.809ms reserved:   5.254ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:      8.954ms reserved:   5.278ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 2 naive:      8.878ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 2 naive:      8.694ms reserved:   5.243ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:      8.661ms reserved:   5.249ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:      8.920ms reserved:   5.248ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 2 naive:      8.952ms reserved:   5.261ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (125.680ms)
run 3 naive:      7.949ms reserved:   5.221ms(std::allocator<int>)
run 3 naive:      8.498ms reserved:   5.238ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:      8.813ms reserved:   5.230ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:      9.033ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 3 naive:      8.909ms reserved:   5.252ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 3 naive:      8.605ms reserved:   5.244ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:      8.623ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:      8.918ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 3 naive:      8.969ms reserved:   5.268ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)

可悲的是,查看细节证实它通过直接委托给标准库来作弊(同时增加了一些 free_list/used_list 地址集的开销)。

总结

是的,标准 pool/simple_segregated_storage 实现在此负载下表现不佳。这是否真的是一个错误,我无法确定,但根据文档(你 mentioned 也是如此)看起来确实如此。