使用 valgrind 发现 mpi 代码中的错误
Using valgrind to spot error in mpi code
我有一个代码可以完美地串行运行,但使用 mpirun -n 2 ./out
时会出现以下错误:
./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090
我尝试使用 valgrind,例如:
valgrind --leak-check=yes mpirun -n 2 ./out
我得到以下输出:
==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494==
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000 0 1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3497 RUNNING AT orhan
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494==
==3494== HEAP SUMMARY:
==3494== in use at exit: 131,120 bytes in 2 blocks
==3494== total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494==
==3494== LEAK SUMMARY:
==3494== definitely lost: 0 bytes in 0 blocks
==3494== indirectly lost: 0 bytes in 0 blocks
==3494== possibly lost: 0 bytes in 0 blocks
==3494== still reachable: 131,120 bytes in 2 blocks
==3494== suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494==
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
我不擅长 valgrind,但据我所知,valgrind 没问题。 valgrind 是否有更好的选项来发现提到的特定错误的来源?
注意上面的调用,
valgrind --leak-check=yes mpirun -n 2 ./out
你在程序 mpirun
上 运行ning valgrind,它可能已经过广泛测试并且工作正常,而不是你知道有问题的程序 ./out
.
要运行 valgrind 在你的测试程序上你会想要做的:
mpirun -n 2 valgrind --leak-check=yes ./out
它使用 mpi运行 启动 2 个进程,每个 运行ning valgrind --leak-check=yes ./out
.
乔纳森·杜尔西 (Jonathan Dursi) 的回答永远不会出错,但让我补充一点,使用多个处理器时,读取 valgrind 输出可能会很痛苦。
不是输出到控制台,而是将其转储到日志文件。当然,如果您将两个进程都转储到同一个日志文件中,那将不会有帮助。相反,记录到多个文件——valgrind 将 '%p' 解释为进程 ID,因此您可以获得两个(或更多)日志文件:
mpiexec -np 2 valgrind --leak-check=full \
--show-reachable=yes --log-file=nc.vg.%p ./noncontig_coll2 -fname blah
我有一个代码可以完美地串行运行,但使用 mpirun -n 2 ./out
时会出现以下错误:
./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090
我尝试使用 valgrind,例如:
valgrind --leak-check=yes mpirun -n 2 ./out
我得到以下输出:
==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494==
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000 0 1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3497 RUNNING AT orhan
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494==
==3494== HEAP SUMMARY:
==3494== in use at exit: 131,120 bytes in 2 blocks
==3494== total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494==
==3494== LEAK SUMMARY:
==3494== definitely lost: 0 bytes in 0 blocks
==3494== indirectly lost: 0 bytes in 0 blocks
==3494== possibly lost: 0 bytes in 0 blocks
==3494== still reachable: 131,120 bytes in 2 blocks
==3494== suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494==
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
我不擅长 valgrind,但据我所知,valgrind 没问题。 valgrind 是否有更好的选项来发现提到的特定错误的来源?
注意上面的调用,
valgrind --leak-check=yes mpirun -n 2 ./out
你在程序 mpirun
上 运行ning valgrind,它可能已经过广泛测试并且工作正常,而不是你知道有问题的程序 ./out
.
要运行 valgrind 在你的测试程序上你会想要做的:
mpirun -n 2 valgrind --leak-check=yes ./out
它使用 mpi运行 启动 2 个进程,每个 运行ning valgrind --leak-check=yes ./out
.
乔纳森·杜尔西 (Jonathan Dursi) 的回答永远不会出错,但让我补充一点,使用多个处理器时,读取 valgrind 输出可能会很痛苦。
不是输出到控制台,而是将其转储到日志文件。当然,如果您将两个进程都转储到同一个日志文件中,那将不会有帮助。相反,记录到多个文件——valgrind 将 '%p' 解释为进程 ID,因此您可以获得两个(或更多)日志文件:
mpiexec -np 2 valgrind --leak-check=full \
--show-reachable=yes --log-file=nc.vg.%p ./noncontig_coll2 -fname blah