为什么在这样的递归代码中避免内存 allocation/deallocation 并不能节省运行时间？

Question

假设一个函数 foo_with_mem_allocs 通过对整数向量进行递归操作，因此它需要在每个递归步骤中 allocate/delete 一个缓冲区数组（类似于 merge_sort算法会做）：

void foo_with_mem_allocs(vector<int>& v, int c)
{
    if (c > 1)
    {
        c /= 2;
        foo_with_mem_allocs(v, c);

        int *buffer = new int[v.size()];
        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;

        delete[] buffer;
    }
}

现在假设想通过将缓冲区传递给函数来避免 allocating/deleting 数组缓冲区的开销。我的直接想法是实现上面相同功能的副本，稍作修改：

void foo_with_buffer(vector<int>& v, int c, vector<int>& buffer)
{

    if (c > 1)
    {
        c /= 2;
        foo_with_buffer(v, c, buffer);

        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;
    }
}

我希望 foo_with_buffer 比 foo_with_mem_allocs 快得多。然而，当我运行测试用例并对它们进行分析时，我得到的实际上是对于任何给定的 v 大小和任何给定值，它们都花费大致相同的时间来运行 c。很多时候，带有缓冲区的版本甚至运行s 更慢：

我真的很想理解为什么没有很多内存分配和释放的版本没有像我认为的那样运行得更快。如果有帮助，我测试了在 Windows 10 64 位下使用 Visual Studio 2015（发布模式，64 位）编译以及在 Unix 下使用 G++ 编译 g++ -o test test.cpp（其中 test.cpp 显然是我将整个代码放入的文件的名称）。上面的图片和结果示例来自 Unix 下的运行。

您可以在下面找到完整的代码，包括分析测试，采用可直接重现的格式：

//////////////////////////////////////////////////////////////////////////////

#include <iostream>
#include <ctime>
#include <cstdlib>
#include <vector>
#include <algorithm>
using namespace std;

void foo_with_mem_allocs(vector<int>& v, int c) {

    if (c > 1)
    {
        c /= 2;
        foo_with_mem_allocs(v, c);

        int *buffer = new int[v.size()];
        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;

        delete[] buffer;
    }
}
void foo_with_buffer(vector<int>& v, int c, vector<int>& buffer) {

    if (c > 1)
    {
        c /= 2;
        foo_with_buffer(v, c, buffer);

        for (int i = 0; i < v.size(); i++)
        {
            buffer[i] = v[i];
        }

        for (int j = 0; j < c; j++)
            v[j] = buffer[j] * c;
    }
}

int main(int argc, char** argv) {

    // setting a random seed using clock
    srand(time(NULL));
    // print out header of output information
    cout << "Size\tMethod\t\t\tTicks\tSecs" << endl;
    // iterate over possible input data sizes specified in the arguments.
    int sz = 100000;
    vector<int> v(sz);
    vector<int> v1(sz);
    vector<int> v2(sz);
    vector<int> buffer(sz);

    for (int i = 0; i < sz; ++i)
        v[i] = v1[i] = v2[i] = rand();

    // timing foo that does a bunch of memory allocations/deallocations
    clock_t t = clock();
    foo_with_mem_allocs(v1, sz);
    t = clock() - t;
    cout << sz << "\tWith mem alloc\t\t" << t << "\t"
        << (double)t / CLOCKS_PER_SEC << endl;
    // timing foo that uses buffers to avoid memory allocations/deallocations
    t = clock();
    foo_with_buffer(v2, sz, buffer);
    t = clock() - t;
    cout << sz << "\tWith buffer\t\t" << t <<
        "\t" << (double)t / CLOCKS_PER_SEC << endl;

    bool changed = false;
    for (int i = 0; i < v1.size(); i++)
        if (v1[i] != v2[i])
            changed = true;
    if (changed)
        std::cout << "Note: results from the two functions were different.\n" << std::endl;
    else
        std::cout << "Note: results from the two functions were identical.\n" << std::endl;
    return 0;
}
///////////////////////////////////////////////////////////////////////////////

Answer 1

您假设动态分配一定很昂贵。你需要重新审视那个预判。

像这样的微基准测试不会揭示内存分配的成本，因为用例非常理想。您总是分配一个相同大小的缓冲区，并在发出另一个分配请求之前立即释放它。任何体面的内存分配库每次都会给你相同的内存，从最近释放的所需大小的分配缓存中获取。代码路径很短。

但即使在现实的基准测试中，您也会发现几代非常有才华的程序员为了您的利益而非常努力地优化内存分配的结果，这样您就可以使用动态分配，而不必将时间浪费在过早的优化上.

作为小后记，未启用优化的微基准测试可能会产生其他不协调之处。例如，std::vector<int> 中未优化的元素访问可能比 int 的简单数组中的元素访问慢得多。这种差异可能足以完全掩盖一些动态分配的轻微成本。

Answer 2

你说的是 "I tried this and I got a confusing time result"。

这意味着您实际上并不知道是什么占了时间。如果您不知道是什么占了时间，那么您所做的就是猜测。 即使是一个好的猜测仍然是猜测，而猜测，就其本质而言，可以欺骗。

This technique 不花钱，超级简单，而且会准确告诉您什么需要时间。没有猜测。

为什么在这样的递归代码中避免内存 allocation/deallocation 并不能节省运行时间？

Why avoiding memory allocation/deallocation in a recursive code like this is not saving run time?

c++

performance

profiling

allocation

为什么在这样的递归代码中避免内存 allocation/deallocation 并不能节省 运行 时间？

Why avoiding memory allocation/deallocation in a recursive code like this is not saving run time?

c++

performance

profiling

allocation

为什么在这样的递归代码中避免内存 allocation/deallocation 并不能节省运行时间？