clock_gettime 从终端执行程序运行时执行时间更长

Question

我试图测量一段代码的时间，并注意到当我从我的编辑器 QtCreator 中运行程序时，与我运行它从 bash shell 在 gnome 终端中启动。我正在使用 Ubuntu 20.04 作为 OS.

重现我的问题的小程序：

#include <stdio.h>
#include <time.h>

struct timespec now() {
  struct timespec now;
  clock_gettime(CLOCK_MONOTONIC, &now);
  return now;
}

long interval_ns(struct timespec tick, struct timespec tock) {
  return (tock.tv_sec - tick.tv_sec) * 1000000000L
      + (tock.tv_nsec - tick.tv_nsec);
}

int main() {
    // sleep(1);
    for (size_t i = 0; i < 10; i++) {
        struct timespec tick = now();
        struct timespec tock = now();
        long elapsed = interval_ns(tick, tock);
        printf("It took %lu ns\n", elapsed);
    }
    return 0;
}

在 QtCreator

中运行时输出

It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns

而当运行来自我的 shell 终端内时：

$ ./foo 
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns

我尝试过但没有效果的事情

让 QtCreator 在终端中启动程序
使用 rdtsc 和 rdtscp 调用而不是 clock_gettime（运行时的相对差异相同）
通过运行在 env -i
使用 sh 而不是 bash

我已验证在所有情况下都调用了相同的二进制文件。我已验证程序在所有情况下的 nice 值为 0。

问题

为什么从我的 shell 启动程序会有所不同？对尝试什么有什么建议吗？

更新

如果我在 main 的开头添加一个 sleep(1) 调用，QtCreator 和 gnome-terminal/bash 调用都会报告更长的执行时间。
如果我在 main 的开头添加一个 system("ps -H") 调用，但删除前面提到的 sleep(1)：两个调用都报告执行时间短（~20 纳秒）。

Answer 1

只需添加更多迭代，让 CPU 有时间加速到最大时钟速度。 你的“慢”时间是 CPU 在low-power空闲时钟速度。

QtCreator 显然使用了足够的 CPU 时间在你的程序运行之前实现这一点，否则你正在编译 + 运行ning 并且编译过程充当 warm-up。（相对于 bash 的 fork/execve 重量更轻。）

见 for more about doing warm-up runs when benchmarking, and also

在我的 i7-6700k (Skylake) 运行ning Linux 上，将循环迭代计数增加到 1000 足以在全时钟速度下获得最终迭代运行ning，即使在处理页面错误、预热 iTLB、uop 缓存、数据缓存等的前几次迭代之后。

$ ./a.out      
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...

在我的系统上，energy_performance_preference 设置为 balance_performance，因此硬件 P-state 调控器不像 performance 那样激进。使用grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference查看，使用sudo更改：

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'

即使运行在 perf stat ./a.out 下设置它也足以很快达到最大时钟速度；这真的不需要太多。但是 bash 在你按下 return 之后的命令解析非常便宜，在它调用 execve 之前完成的工作并不多 CPU 并且在您的新流程中达到 main。

带有 line-buffered 输出的 printf 占用了程序中大部分 CPU 的时间，顺便说一句。这就是为什么只需要很少的迭代就可以加速的原因。例如如果你运行 perf stat --all-user -r10 ./a.out，你会看到每秒 user-space 核心时钟周期只有 0.4GHz，其余时间花在 write 系统的内核中来电。

clock_gettime 从终端执行程序运行时执行时间更长

clock_gettime takes longer to execute when program run from terminal

c

linux

x86

cpu-cycles

microbenchmark

我尝试过但没有效果的事情

问题

更新

clock_gettime 从终端执行程序 运行 时执行时间更长

clock_gettime takes longer to execute when program run from terminal

c

linux

x86

cpu-cycles

microbenchmark

我尝试过但没有效果的事情

问题

更新

clock_gettime 从终端执行程序运行时执行时间更长