使用 gprof 分析我的代码时不一致
Inconsistency when profiling my code with gprof
我正在使用与 OpenMP 并行化的相对简单的代码来熟悉 gprof。
我的代码主要包括从输入文件收集数据、执行一些数组操作以及将新数据写入不同的输出文件。我调用了内部子例程 CPU_TIME
以查看 gprof 是否准确:
PROGRAM main
USE global_variables
USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
USE change_vars
IMPLICIT NONE
REAL(dp) :: t0, t1
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL allocate_data
CALL CPU_TIME(t1)
PRINT*, "Allocate data =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL build_grid
CALL CPU_TIME(t1)
PRINT*, "Build grid =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL read_old_restart
CALL CPU_TIME(t1)
PRINT*, "Read restart =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL regroup_all
CALL CPU_TIME(t1)
PRINT*, "Regroup all =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL redistribute_all
CALL CPU_TIME(t1)
PRINT*, "Redistribute =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL write_new_restart
CALL CPU_TIME(t1)
PRINT*, "Write restart =", t1 - t0
END PROGRAM main
这是输出:
Allocate data = 1.000000000000000E-003
Build grid = 0.000000000000000E+000
Read restart = 10.7963590000000
Regroup all = 6.65998700000000
Redistribute = 14.3518180000000
Write restart = 53.5218640000000
因此,write_new_restart
子程序是最耗时的,大约占总运行时间的62%。然而根据grof,redistribute_all
多次调用的子例程redistribute_vars
是最耗时的,占总时间的70%。这是 gprof 的输出:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
74.40 8.95 8.95 61 0.15 0.15 change_vars_mp_redistribute_vars_
19.12 11.25 2.30 60 0.04 0.04 change_vars_mp_regroup_vars_
6.23 12.00 0.75 63 0.01 0.01 change_vars_mp_fill_last_blocks_
0.08 12.01 0.01 1 0.01 2.31 change_vars_mp_regroup_all_
0.08 12.02 0.01 __intel_ssse3_rep_memcpy
0.08 12.03 0.01 for_open
0.00 12.03 0.00 1 0.00 12.01 MAIN__
0.00 12.03 0.00 1 0.00 0.00 change_vars_mp_build_grid_
0.00 12.03 0.00 1 0.00 9.70 change_vars_mp_redistribute_all_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_read_old_restart_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_write_new_restart_
0.00 12.03 0.00 1 0.00 0.00 global_variables_mp_allocate_data_
index % time self children called name
0.00 12.01 1/1 main [2]
[1] 99.8 0.00 12.01 1 MAIN__ [1]
0.00 9.70 1/1 change_vars_mp_redistribute_all_ [3]
0.01 2.30 1/1 change_vars_mp_regroup_all_ [5]
0.00 0.00 1/1 global_variables_mp_allocate_data_ [13]
0.00 0.00 1/1 change_vars_mp_build_grid_ [10]
0.00 0.00 1/1 fileio_mp_read_old_restart_ [11]
0.00 0.00 1/1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
<spontaneous>
[2] 99.8 0.00 12.01 main [2]
0.00 12.01 1/1 MAIN__ [1]
-----------------------------------------------
0.00 9.70 1/1 MAIN__ [1]
[3] 80.6 0.00 9.70 1 change_vars_mp_redistribute_all_ [3]
8.95 0.00 61/61 change_vars_mp_redistribute_vars_ [4]
0.75 0.00 63/63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
8.95 0.00 61/61 change_vars_mp_redistribute_all_ [3]
[4] 74.4 8.95 0.00 61 change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
0.01 2.30 1/1 MAIN__ [1]
[5] 19.2 0.01 2.30 1 change_vars_mp_regroup_all_ [5]
2.30 0.00 60/60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
2.30 0.00 60/60 change_vars_mp_regroup_all_ [5]
[6] 19.1 2.30 0.00 60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
0.75 0.00 63/63 change_vars_mp_redistribute_all_ [3]
[7] 6.2 0.75 0.00 63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
<spontaneous>
[8] 0.1 0.01 0.00 for_open [8]
-----------------------------------------------
<spontaneous>
[9] 0.1 0.01 0.00 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[10] 0.0 0.00 0.00 1 change_vars_mp_build_grid_ [10]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[11] 0.0 0.00 0.00 1 fileio_mp_read_old_restart_ [11]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[12] 0.0 0.00 0.00 1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[13] 0.0 0.00 0.00 1 global_variables_mp_allocate_data_ [13]
-----------------------------------------------
请注意,regroup_all
多次调用 regroup_vars
,redistribute_all
多次调用 redistribute_vars
和 fill_last_blocks
。
我正在使用 ifort
和 -openmp -O2 -pg
选项编译我的代码。
问题:
为什么 gprof 看不到我的文件 i/o 子例程占用的时间? (read_old_restart
, write_new_restart
)
gprof
具体不包括I/O时间。它只尝试测量 CPU 时间。
那是因为它只做两件事:1)在1/100秒的时钟上对程序计数器进行采样,在I/O期间程序计数器没有意义,以及2)计算任何函数的次数B 被任何函数 A 调用。
根据 call-counts,它会尝试猜测每个函数的 CPU 时间中有多少可以归因于每个调用者。
这就是它比 pre-existing 分析器的全部进步。
当你使用gprof
时,你应该明白它的作用和what it doesn't do。
我正在使用与 OpenMP 并行化的相对简单的代码来熟悉 gprof。
我的代码主要包括从输入文件收集数据、执行一些数组操作以及将新数据写入不同的输出文件。我调用了内部子例程 CPU_TIME
以查看 gprof 是否准确:
PROGRAM main
USE global_variables
USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
USE change_vars
IMPLICIT NONE
REAL(dp) :: t0, t1
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL allocate_data
CALL CPU_TIME(t1)
PRINT*, "Allocate data =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL build_grid
CALL CPU_TIME(t1)
PRINT*, "Build grid =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL read_old_restart
CALL CPU_TIME(t1)
PRINT*, "Read restart =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL regroup_all
CALL CPU_TIME(t1)
PRINT*, "Regroup all =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL redistribute_all
CALL CPU_TIME(t1)
PRINT*, "Redistribute =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL write_new_restart
CALL CPU_TIME(t1)
PRINT*, "Write restart =", t1 - t0
END PROGRAM main
这是输出:
Allocate data = 1.000000000000000E-003
Build grid = 0.000000000000000E+000
Read restart = 10.7963590000000
Regroup all = 6.65998700000000
Redistribute = 14.3518180000000
Write restart = 53.5218640000000
因此,write_new_restart
子程序是最耗时的,大约占总运行时间的62%。然而根据grof,redistribute_all
多次调用的子例程redistribute_vars
是最耗时的,占总时间的70%。这是 gprof 的输出:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
74.40 8.95 8.95 61 0.15 0.15 change_vars_mp_redistribute_vars_
19.12 11.25 2.30 60 0.04 0.04 change_vars_mp_regroup_vars_
6.23 12.00 0.75 63 0.01 0.01 change_vars_mp_fill_last_blocks_
0.08 12.01 0.01 1 0.01 2.31 change_vars_mp_regroup_all_
0.08 12.02 0.01 __intel_ssse3_rep_memcpy
0.08 12.03 0.01 for_open
0.00 12.03 0.00 1 0.00 12.01 MAIN__
0.00 12.03 0.00 1 0.00 0.00 change_vars_mp_build_grid_
0.00 12.03 0.00 1 0.00 9.70 change_vars_mp_redistribute_all_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_read_old_restart_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_write_new_restart_
0.00 12.03 0.00 1 0.00 0.00 global_variables_mp_allocate_data_
index % time self children called name
0.00 12.01 1/1 main [2]
[1] 99.8 0.00 12.01 1 MAIN__ [1]
0.00 9.70 1/1 change_vars_mp_redistribute_all_ [3]
0.01 2.30 1/1 change_vars_mp_regroup_all_ [5]
0.00 0.00 1/1 global_variables_mp_allocate_data_ [13]
0.00 0.00 1/1 change_vars_mp_build_grid_ [10]
0.00 0.00 1/1 fileio_mp_read_old_restart_ [11]
0.00 0.00 1/1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
<spontaneous>
[2] 99.8 0.00 12.01 main [2]
0.00 12.01 1/1 MAIN__ [1]
-----------------------------------------------
0.00 9.70 1/1 MAIN__ [1]
[3] 80.6 0.00 9.70 1 change_vars_mp_redistribute_all_ [3]
8.95 0.00 61/61 change_vars_mp_redistribute_vars_ [4]
0.75 0.00 63/63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
8.95 0.00 61/61 change_vars_mp_redistribute_all_ [3]
[4] 74.4 8.95 0.00 61 change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
0.01 2.30 1/1 MAIN__ [1]
[5] 19.2 0.01 2.30 1 change_vars_mp_regroup_all_ [5]
2.30 0.00 60/60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
2.30 0.00 60/60 change_vars_mp_regroup_all_ [5]
[6] 19.1 2.30 0.00 60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
0.75 0.00 63/63 change_vars_mp_redistribute_all_ [3]
[7] 6.2 0.75 0.00 63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
<spontaneous>
[8] 0.1 0.01 0.00 for_open [8]
-----------------------------------------------
<spontaneous>
[9] 0.1 0.01 0.00 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[10] 0.0 0.00 0.00 1 change_vars_mp_build_grid_ [10]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[11] 0.0 0.00 0.00 1 fileio_mp_read_old_restart_ [11]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[12] 0.0 0.00 0.00 1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[13] 0.0 0.00 0.00 1 global_variables_mp_allocate_data_ [13]
-----------------------------------------------
请注意,regroup_all
多次调用 regroup_vars
,redistribute_all
多次调用 redistribute_vars
和 fill_last_blocks
。
我正在使用 ifort
和 -openmp -O2 -pg
选项编译我的代码。
问题:
为什么 gprof 看不到我的文件 i/o 子例程占用的时间? (read_old_restart
, write_new_restart
)
gprof
具体不包括I/O时间。它只尝试测量 CPU 时间。
那是因为它只做两件事:1)在1/100秒的时钟上对程序计数器进行采样,在I/O期间程序计数器没有意义,以及2)计算任何函数的次数B 被任何函数 A 调用。
根据 call-counts,它会尝试猜测每个函数的 CPU 时间中有多少可以归因于每个调用者。 这就是它比 pre-existing 分析器的全部进步。
当你使用gprof
时,你应该明白它的作用和what it doesn't do。