clock_gettime 从终端执行程序 运行 时执行时间更长
clock_gettime takes longer to execute when program run from terminal
我试图测量一段代码的时间,并注意到当我从我的编辑器 QtCreator 中 运行 程序时,与我 运行 它从 bash shell 在 gnome 终端中启动。我正在使用 Ubuntu 20.04 作为 OS.
重现我的问题的小程序:
#include <stdio.h>
#include <time.h>
struct timespec now() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now;
}
long interval_ns(struct timespec tick, struct timespec tock) {
return (tock.tv_sec - tick.tv_sec) * 1000000000L
+ (tock.tv_nsec - tick.tv_nsec);
}
int main() {
// sleep(1);
for (size_t i = 0; i < 10; i++) {
struct timespec tick = now();
struct timespec tock = now();
long elapsed = interval_ns(tick, tock);
printf("It took %lu ns\n", elapsed);
}
return 0;
}
在 QtCreator
中 运行 时输出
It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns
而当 运行 来自我的 shell 终端内时:
$ ./foo
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns
我尝试过但没有效果的事情
- 让 QtCreator 在终端中启动程序
- 使用 rdtsc 和 rdtscp 调用而不是 clock_gettime(运行时的相对差异相同)
- 通过 运行 在
env -i
下从终端清除环境
- 使用 sh 而不是 bash
启动程序
我已验证在所有情况下都调用了相同的二进制文件。
我已验证程序在所有情况下的 nice 值为 0。
问题
为什么从我的 shell 启动程序会有所不同?对尝试什么有什么建议吗?
更新
如果我在 main 的开头添加一个 sleep(1) 调用,QtCreator 和 gnome-terminal/bash 调用都会报告更长的执行时间。
如果我在 main 的开头添加一个 system("ps -H") 调用,但删除前面提到的 sleep(1):两个调用都报告执行时间短(~20 纳秒)。
只需添加更多迭代,让 CPU 有时间加速到最大时钟速度。 你的“慢”时间是 CPU 在low-power空闲时钟速度。
QtCreator 显然使用了足够的 CPU 时间在你的程序 运行 之前实现这一点,否则你正在编译 + 运行ning 并且编译过程充当 warm-up。 (相对于 bash
的 fork/execve 重量更轻。)
见 for more about doing warm-up runs when benchmarking, and also
在我的 i7-6700k (Skylake) 运行ning Linux 上,将循环迭代计数增加到 1000 足以在全时钟速度下获得最终迭代 运行ning,即使在处理页面错误、预热 iTLB、uop 缓存、数据缓存等的前几次迭代之后。
$ ./a.out
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...
在我的系统上,energy_performance_preference 设置为 balance_performance
,因此硬件 P-state 调控器不像 performance
那样激进。使用grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
查看,使用sudo
更改:
sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'
即使 运行 在 perf stat ./a.out
下设置它也足以很快达到最大时钟速度;这真的不需要太多。但是 bash
在你按下 return 之后的命令解析 非常 便宜,在它调用 execve
之前完成的工作并不多 CPU 并且在您的新流程中达到 main
。
带有 line-buffered 输出的 printf
占用了程序中大部分 CPU 的时间,顺便说一句。这就是为什么只需要很少的迭代就可以加速的原因。例如如果你 运行 perf stat --all-user -r10 ./a.out
,你会看到每秒 user-space 核心时钟周期只有 0.4GHz,其余时间花在 write
系统的内核中来电。
我试图测量一段代码的时间,并注意到当我从我的编辑器 QtCreator 中 运行 程序时,与我 运行 它从 bash shell 在 gnome 终端中启动。我正在使用 Ubuntu 20.04 作为 OS.
重现我的问题的小程序:
#include <stdio.h>
#include <time.h>
struct timespec now() {
struct timespec now;
clock_gettime(CLOCK_MONOTONIC, &now);
return now;
}
long interval_ns(struct timespec tick, struct timespec tock) {
return (tock.tv_sec - tick.tv_sec) * 1000000000L
+ (tock.tv_nsec - tick.tv_nsec);
}
int main() {
// sleep(1);
for (size_t i = 0; i < 10; i++) {
struct timespec tick = now();
struct timespec tock = now();
long elapsed = interval_ns(tick, tock);
printf("It took %lu ns\n", elapsed);
}
return 0;
}
在 QtCreator
中 运行 时输出It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns
而当 运行 来自我的 shell 终端内时:
$ ./foo
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns
我尝试过但没有效果的事情
- 让 QtCreator 在终端中启动程序
- 使用 rdtsc 和 rdtscp 调用而不是 clock_gettime(运行时的相对差异相同)
- 通过 运行 在
env -i
下从终端清除环境
- 使用 sh 而不是 bash 启动程序
我已验证在所有情况下都调用了相同的二进制文件。 我已验证程序在所有情况下的 nice 值为 0。
问题
为什么从我的 shell 启动程序会有所不同?对尝试什么有什么建议吗?
更新
如果我在 main 的开头添加一个 sleep(1) 调用,QtCreator 和 gnome-terminal/bash 调用都会报告更长的执行时间。
如果我在 main 的开头添加一个 system("ps -H") 调用,但删除前面提到的 sleep(1):两个调用都报告执行时间短(~20 纳秒)。
只需添加更多迭代,让 CPU 有时间加速到最大时钟速度。 你的“慢”时间是 CPU 在low-power空闲时钟速度。
QtCreator 显然使用了足够的 CPU 时间在你的程序 运行 之前实现这一点,否则你正在编译 + 运行ning 并且编译过程充当 warm-up。 (相对于 bash
的 fork/execve 重量更轻。)
见
在我的 i7-6700k (Skylake) 运行ning Linux 上,将循环迭代计数增加到 1000 足以在全时钟速度下获得最终迭代 运行ning,即使在处理页面错误、预热 iTLB、uop 缓存、数据缓存等的前几次迭代之后。
$ ./a.out
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...
在我的系统上,energy_performance_preference 设置为 balance_performance
,因此硬件 P-state 调控器不像 performance
那样激进。使用grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
查看,使用sudo
更改:
sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'
即使 运行 在 perf stat ./a.out
下设置它也足以很快达到最大时钟速度;这真的不需要太多。但是 bash
在你按下 return 之后的命令解析 非常 便宜,在它调用 execve
之前完成的工作并不多 CPU 并且在您的新流程中达到 main
。
带有 line-buffered 输出的 printf
占用了程序中大部分 CPU 的时间,顺便说一句。这就是为什么只需要很少的迭代就可以加速的原因。例如如果你 运行 perf stat --all-user -r10 ./a.out
,你会看到每秒 user-space 核心时钟周期只有 0.4GHz,其余时间花在 write
系统的内核中来电。