为什么同样的for循环第二次跑得更快?
Why does the same for loop runs faster in the second time?
最初我是在比较内置 D 数组和普通指针的性能,但我遇到了一个不同的问题。出于某种原因,如果我 运行 两个相同的 for 循环一个接一个,第二个总是完成得更快。
代码如下:
import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
StopWatch sw;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
sw.stop();
writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
sw.stop();
writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
free(ptr);
}
编译并运行与dmd -release -O -noboundscheck -inline test.d -of=test && ./test
结合后,它打印:
the first for loop time: 1253 msecs
the second for loop time: 357 msecs
我不确定这是否与 D 或 dmd 有关,所以我用 C++ 重写了这段代码:
#include <iostream>
#include <chrono>
int main()
{
const unsigned int N = 1000000000;
unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);
auto start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
auto finish = std::chrono::high_resolution_clock::now();
auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;
start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
finish = std::chrono::high_resolution_clock::now();
milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;
free(ptr);
}
和 g++ -O3 test.cpp -o test && ./test
给出类似的输出:
the first for loop time: 1029 msecs
the second for loop time: 349 msecs
每次我运行这段代码的结果都是一样的。分配的数据太大而无法缓存。没有分支点,因此不应涉及分支预测问题。在两个循环中以相同的直接顺序访问内存,所以我想这应该与内存布局无关。
那么为什么第二个 运行 比第一个快?
因为 uint* ptr = cast(uint*)malloc(uint.sizeof * N);
在对许多元素进行循环之前不会分配内存。你可以测试一下:
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
foreach (_; 0 .. 100)
for (uint i = 0; i < N; ++i)
ptr[N-1] = 1;
// until this point almost no memory is allocated
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
free(ptr);
}
更新
@Eljay 已经在评论中解释了这一点
最初我是在比较内置 D 数组和普通指针的性能,但我遇到了一个不同的问题。出于某种原因,如果我 运行 两个相同的 for 循环一个接一个,第二个总是完成得更快。
代码如下:
import std.stdio : writeln;
import std.datetime : StopWatch;
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
StopWatch sw;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
sw.stop();
writeln("the first for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
sw.start();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
sw.stop();
writeln("the second for loop time: ", sw.peek().msecs(), " msecs");
sw.reset();
free(ptr);
}
编译并运行与dmd -release -O -noboundscheck -inline test.d -of=test && ./test
结合后,它打印:
the first for loop time: 1253 msecs
the second for loop time: 357 msecs
我不确定这是否与 D 或 dmd 有关,所以我用 C++ 重写了这段代码:
#include <iostream>
#include <chrono>
int main()
{
const unsigned int N = 1000000000;
unsigned int* ptr = (unsigned int*)malloc(sizeof(unsigned int) * N);
auto start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 1;
auto finish = std::chrono::high_resolution_clock::now();
auto milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the first for loop time: " << milliseconds.count() << " msecs" << std::endl;
start = std::chrono::high_resolution_clock::now();
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
finish = std::chrono::high_resolution_clock::now();
milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << "the second for loop time: " << milliseconds.count() << " msecs" << std::endl;
free(ptr);
}
和 g++ -O3 test.cpp -o test && ./test
给出类似的输出:
the first for loop time: 1029 msecs
the second for loop time: 349 msecs
每次我运行这段代码的结果都是一样的。分配的数据太大而无法缓存。没有分支点,因此不应涉及分支预测问题。在两个循环中以相同的直接顺序访问内存,所以我想这应该与内存布局无关。
那么为什么第二个 运行 比第一个快?
因为 uint* ptr = cast(uint*)malloc(uint.sizeof * N);
在对许多元素进行循环之前不会分配内存。你可以测试一下:
import core.stdc.stdlib : malloc, free;
void main()
{
immutable N = 1_000_000_000;
uint* ptr = cast(uint*)malloc(uint.sizeof * N);
foreach (_; 0 .. 100)
for (uint i = 0; i < N; ++i)
ptr[N-1] = 1;
// until this point almost no memory is allocated
for (uint i = 0; i < N; ++i)
ptr[i] = 2;
free(ptr);
}
更新 @Eljay 已经在评论中解释了这一点