为什么同样的for循环第二次跑得更快?

Why does the same for loop runs faster in the second time?

最初我是在比较内置 D 数组和普通指针的性能,但我遇到了一个不同的问题。出于某种原因,如果我 运行 两个相同的 for 循环一个接一个,第二个总是完成得更快。

代码如下:

import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    StopWatch sw;

    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    sw.stop();
    writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    sw.start();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    sw.stop();
    writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
    sw.reset();

    free(ptr);
}

编译并运行与dmd -release -O -noboundscheck -inline test.d -of=test && ./test结合后,它打印:

the first for loop time: 1253 msecs
the second for loop time: 357 msecs

我不确定这是否与 D 或 dmd 有关,所以我用 C++ 重写了这段代码:

#include <iostream>
#include <chrono>

int main()
{
    const unsigned int N = 1000000000;

    unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);

    auto start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 1;
    auto finish = std::chrono::high_resolution_clock::now();
    auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;
    finish = std::chrono::high_resolution_clock::now();
    milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
    std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;

    free(ptr);
}

g++ -O3 test.cpp -o test && ./test 给出类似的输出:

the first for loop time: 1029 msecs
the second for loop time: 349 msecs

每次我运行这段代码的结果都是一样的。分配的数据太大而无法缓存。没有分支点,因此不应涉及分支预测问题。在两个循环中以相同的直接顺序访问内存,所以我想这应该与内存布局无关。

那么为什么第二个 运行 比第一个快?

因为 uint* ptr = cast(uint*)malloc(uint.sizeof * N); 在对许多元素进行循环之前不会分配内存。你可以测试一下:

import core.stdc.stdlib : malloc, free;

void main()
{
    immutable N = 1_000_000_000;
    uint* ptr = cast(uint*)malloc(uint.sizeof * N);

    foreach (_; 0 .. 100)
    for (uint i = 0; i < N; ++i)
        ptr[N-1] = 1;

    // until this point almost no memory is allocated
    for (uint i = 0; i < N; ++i)
        ptr[i] = 2;

    free(ptr);
}

更新 @Eljay 已经在评论中解释了这一点