使用 gprof 分析我的代码时不一致

Inconsistency when profiling my code with gprof

我正在使用与 OpenMP 并行化的相对简单的代码来熟悉 gprof。

我的代码主要包括从输入文件收集数据、执行一些数组操作以及将新数据写入不同的输出文件。我调用了内部子例程 CPU_TIME 以查看 gprof 是否准确:

PROGRAM main
    USE global_variables
    USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
    USE change_vars
    IMPLICIT NONE
    REAL(dp) :: t0, t1

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL allocate_data
    CALL CPU_TIME(t1)
    PRINT*, "Allocate data =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL build_grid
    CALL CPU_TIME(t1)
    PRINT*, "Build grid    =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL read_old_restart
    CALL CPU_TIME(t1)
    PRINT*, "Read restart  =", t1 - t0


    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL regroup_all
    CALL CPU_TIME(t1)
    PRINT*, "Regroup all   =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL redistribute_all
    CALL CPU_TIME(t1)
    PRINT*, "Redistribute  =", t1 - t0

    !~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    CALL CPU_TIME(t0)
    CALL write_new_restart
    CALL CPU_TIME(t1)
    PRINT*, "Write restart =", t1 - t0
END PROGRAM main

这是输出:

 Allocate data =  1.000000000000000E-003
 Build grid    =  0.000000000000000E+000
 Read restart  =   10.7963590000000
 Regroup all   =   6.65998700000000
 Redistribute  =   14.3518180000000
 Write restart =   53.5218640000000

因此,write_new_restart子程序是最耗时的,大约占总运行时间的62%。然而根据grof,redistribute_all多次调用的子例程redistribute_vars是最耗时的,占总时间的70%。这是 gprof 的输出:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 74.40      8.95     8.95       61     0.15     0.15  change_vars_mp_redistribute_vars_
 19.12     11.25     2.30       60     0.04     0.04  change_vars_mp_regroup_vars_
  6.23     12.00     0.75       63     0.01     0.01  change_vars_mp_fill_last_blocks_
  0.08     12.01     0.01        1     0.01     2.31  change_vars_mp_regroup_all_
  0.08     12.02     0.01                             __intel_ssse3_rep_memcpy
  0.08     12.03     0.01                             for_open
  0.00     12.03     0.00        1     0.00    12.01  MAIN__
  0.00     12.03     0.00        1     0.00     0.00  change_vars_mp_build_grid_
  0.00     12.03     0.00        1     0.00     9.70  change_vars_mp_redistribute_all_
  0.00     12.03     0.00        1     0.00     0.00  fileio_mp_read_old_restart_
  0.00     12.03     0.00        1     0.00     0.00  fileio_mp_write_new_restart_
  0.00     12.03     0.00        1     0.00     0.00  global_variables_mp_allocate_data_


index % time    self  children    called     name
                0.00   12.01       1/1           main [2]
[1]     99.8    0.00   12.01       1         MAIN__ [1]
                0.00    9.70       1/1           change_vars_mp_redistribute_all_ [3]
                0.01    2.30       1/1           change_vars_mp_regroup_all_ [5]
                0.00    0.00       1/1           global_variables_mp_allocate_data_ [13]
                0.00    0.00       1/1           change_vars_mp_build_grid_ [10]
                0.00    0.00       1/1           fileio_mp_read_old_restart_ [11]
                0.00    0.00       1/1           fileio_mp_write_new_restart_ [12]
-----------------------------------------------
                                                 <spontaneous>
[2]     99.8    0.00   12.01                 main [2]
                0.00   12.01       1/1           MAIN__ [1]
-----------------------------------------------
                0.00    9.70       1/1           MAIN__ [1]
[3]     80.6    0.00    9.70       1         change_vars_mp_redistribute_all_ [3]
                8.95    0.00      61/61          change_vars_mp_redistribute_vars_ [4]
                0.75    0.00      63/63          change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
                8.95    0.00      61/61          change_vars_mp_redistribute_all_ [3]
[4]     74.4    8.95    0.00      61         change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
                0.01    2.30       1/1           MAIN__ [1]
[5]     19.2    0.01    2.30       1         change_vars_mp_regroup_all_ [5]
                2.30    0.00      60/60          change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
                2.30    0.00      60/60          change_vars_mp_regroup_all_ [5]
[6]     19.1    2.30    0.00      60         change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
                0.75    0.00      63/63          change_vars_mp_redistribute_all_ [3]
[7]      6.2    0.75    0.00      63         change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
                                                 <spontaneous>
[8]      0.1    0.01    0.00                 for_open [8]
-----------------------------------------------
                                                 <spontaneous>
[9]      0.1    0.01    0.00                 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[10]     0.0    0.00    0.00       1         change_vars_mp_build_grid_ [10]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[11]     0.0    0.00    0.00       1         fileio_mp_read_old_restart_ [11]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[12]     0.0    0.00    0.00       1         fileio_mp_write_new_restart_ [12]
-----------------------------------------------
                0.00    0.00       1/1           MAIN__ [1]
[13]     0.0    0.00    0.00       1         global_variables_mp_allocate_data_ [13]
-----------------------------------------------

请注意,regroup_all 多次调用 regroup_varsredistribute_all 多次调用 redistribute_varsfill_last_blocks

我正在使用 ifort-openmp -O2 -pg 选项编译我的代码。

问题:

为什么 gprof 看不到我的文件 i/o 子例程占用的时间? (read_old_restart, write_new_restart)

gprof具体不包括I/O时间。它只尝试测量 CPU 时间。

那是因为它只做两件事:1)在1/100秒的时钟上对程序计数器进行采样,在I/O期间程序计数器没有意义,以及2)计算任何函数的次数B 被任何函数 A 调用。

根据 call-counts,它会尝试猜测每个函数的 CPU 时间中有多少可以归因于每个调用者。 这就是它比 pre-existing 分析器的全部进步。

当你使用gprof时,你应该明白它的作用和what it doesn't do