使用线程并行化 ssh 调用的 malloc 问题
malloc issues with parallelization of ssh calls using threads
我正在尝试进行多个 ssh 连接(使用 libssh 0.7.5)并使用 boost::threads 并行化它们。
并行化后,我观察到我的可执行文件的 20 次执行中有 3 次因 glibc "double free or corruption" 错误而失败,而 2 次因分段错误而失败。
尝试进行串行 ssh 连接调用时未观察到这些错误。
使用 gdb 作为调试工具,我发现以下回溯导致问题:
#0 0x00007ffff49735e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff4974dc5 in abort () at abort.c:92
#2 0x00007ffff49b14f7 in __libc_message (do_abort=2, fmt=0x7ffff4a99a60 "*** glibc detected *** %s: %s: 0x%s ***\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x00007ffff49b6f3e in malloc_printerr (action=3, str=0x7ffff4a99df0 "double free or corruption (!prev)",
ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6360
#4 0x00007ffff49b9dd0 in _int_free (av=0x7fffe0000020, p=0x7fffe0001810, have_lock=1) at malloc.c:4846
#5 0x00007ffff49bcd60 in _int_realloc (av=0x7fffe0000020, oldp=0x7fffe0001810, oldsize=<value optimized out>, nb=272)
at malloc.c:5398
#6 0x00007ffff49bd058 in __libc_realloc (oldmem=0x7fffe0001820, bytes=256) at malloc.c:3833
#7 0x00007ffff6f06ccf in CRYPTO_realloc () from /usr/lib64/libcrypto.so.10
#8 0x00007ffff6f822be in lh_insert () from /usr/lib64/libcrypto.so.10
#9 0x00007ffff6f09d9b in OBJ_NAME_add () from /usr/lib64/libcrypto.so.10
#10 0x00007ffff6f914e7 in OpenSSL_add_all_ciphers () from /usr/lib64/libcrypto.so.10
#11 0x00007ffff6f911ae in OPENSSL_add_all_algorithms_noconf () from /usr/lib64/libcrypto.so.10
#12 0x00007ffff7b69c8a in ssh_crypto_init () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#13 0x00007ffff7b6ae05 in ssh_init () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#14 0x00007ffff7b65799 in ssh_connect () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#15 0x000000000043f766 in SSHConnector::SSHConnector(std::basic_string<char, std::char_traits<char>, std::allocator<char> >) () at SSHConnector.cpp:44
#16 0x0000000000433855 in connect_host(connect_host_report*) () at Util.cpp:339
#17 0x000000000043f57d in void boost::_bi::list1<boost::_bi::value<connect_host_report*> >::operator()<void (*)(connect_host_report*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(connect_host_report*), boost::_bi::list0&, int)
() at /home/utils/boost-1.55.0//include/boost/bind/bind.hpp:253
#18 0x000000000043f0f9 in boost::_bi::bind_t<void, void (*)(connect_host_report*), boost::_bi::list1<boost::_bi::value<connect_host_report*> > >::operator()() () at /home/utils/boost-1.55.0//include/boost/bind/bind_template.hpp:20
#19 0x000000000043e738 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(connect_host_report*), boost::_bi::list1<boost::_bi::value<connect_host_report*> > > >::run() ()
at /home/utils/boost-1.55.0//include/boost/thread/detail/thread.hpp:117
#20 0x00007ffff60995da in thread_proxy () from /home/utils/boost-1.55.0//lib/libboost_thread.so.1.55.0
#21 0x00007ffff6c86aa1 in start_thread (arg=0x7ffff1593700) at pthread_create.c:301
#22 0x00007ffff4a29aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
同时粘贴SSHConnector的相关代码(第44行ssh_connect是error的来源)
28 string error_info;
29
30 ssh_session_object = ssh_new();
31 if (ssh_session_object == NULL) {
32 error_info = "Could not create SSH session";
33 throw ConnectionUnsuccessfulException(hostname, error_info,
34 CLEANUP_NOT_REQUIRED);
35 }
36
37 ssh_options_set(ssh_session_object, SSH_OPTIONS_HOST, hostname.c_str());
38 ssh_options_set(ssh_session_object, SSH_OPTIONS_LOG_VERBOSITY,
39 &verbosity);
40 ssh_options_set(ssh_session_object, SSH_OPTIONS_TIMEOUT,
41 &ssh_connection_timeout);
42
43 int rc;
44 rc = ssh_connect(ssh_session_object);
也用 valgrind 进行了测试:
==641==
==641== HEAP SUMMARY:
==641== in use at exit: 16,104 bytes in 540 blocks
==641== total heap usage: 4,791 allocs, 4,251 frees, 559,189 bytes allocated
==641==
==641== Searching for pointers to 540 not-freed blocks
==641== Checked 1,006,816 bytes
==641==
==641== LEAK SUMMARY:
==641== definitely lost: 0 bytes in 0 blocks
==641== indirectly lost: 0 bytes in 0 blocks
==641== possibly lost: 0 bytes in 0 blocks
==641== still reachable: 16,104 bytes in 540 blocks
==641== suppressed: 0 bytes in 0 blocks
==641== Rerun with --leak-check=full to see details of leaked memory
==641==
==641== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
--641--
--641-- used_suppression: 4 U1004-ARM-_dl_relocate_object
--641-- used_suppression: 2 glibc-2.5.x-on-SUSE-10.2-(PPC)-2a
==641==
==641== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
任何人都可以提出导致此错误随机行为的原因吗?
1.1 之前的 OpenSSL 版本不是线程安全的——除非应用程序注册了自己的锁定回调。 libssh
提供了一组默认的锁定回调,但它的初始化序列本身不是线程安全的(在 0.7 版本中),因此您必须显式调用 ssh_init
或 ssh_threads_init
以避免竞争条件.
我正在尝试进行多个 ssh 连接(使用 libssh 0.7.5)并使用 boost::threads 并行化它们。
并行化后,我观察到我的可执行文件的 20 次执行中有 3 次因 glibc "double free or corruption" 错误而失败,而 2 次因分段错误而失败。
尝试进行串行 ssh 连接调用时未观察到这些错误。
使用 gdb 作为调试工具,我发现以下回溯导致问题:
#0 0x00007ffff49735e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff4974dc5 in abort () at abort.c:92
#2 0x00007ffff49b14f7 in __libc_message (do_abort=2, fmt=0x7ffff4a99a60 "*** glibc detected *** %s: %s: 0x%s ***\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:198
#3 0x00007ffff49b6f3e in malloc_printerr (action=3, str=0x7ffff4a99df0 "double free or corruption (!prev)",
ptr=<value optimized out>, ar_ptr=<value optimized out>) at malloc.c:6360
#4 0x00007ffff49b9dd0 in _int_free (av=0x7fffe0000020, p=0x7fffe0001810, have_lock=1) at malloc.c:4846
#5 0x00007ffff49bcd60 in _int_realloc (av=0x7fffe0000020, oldp=0x7fffe0001810, oldsize=<value optimized out>, nb=272)
at malloc.c:5398
#6 0x00007ffff49bd058 in __libc_realloc (oldmem=0x7fffe0001820, bytes=256) at malloc.c:3833
#7 0x00007ffff6f06ccf in CRYPTO_realloc () from /usr/lib64/libcrypto.so.10
#8 0x00007ffff6f822be in lh_insert () from /usr/lib64/libcrypto.so.10
#9 0x00007ffff6f09d9b in OBJ_NAME_add () from /usr/lib64/libcrypto.so.10
#10 0x00007ffff6f914e7 in OpenSSL_add_all_ciphers () from /usr/lib64/libcrypto.so.10
#11 0x00007ffff6f911ae in OPENSSL_add_all_algorithms_noconf () from /usr/lib64/libcrypto.so.10
#12 0x00007ffff7b69c8a in ssh_crypto_init () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#13 0x00007ffff7b6ae05 in ssh_init () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#14 0x00007ffff7b65799 in ssh_connect () from /home/utils/libssh-0.7.5/lib/libssh.so.4
#15 0x000000000043f766 in SSHConnector::SSHConnector(std::basic_string<char, std::char_traits<char>, std::allocator<char> >) () at SSHConnector.cpp:44
#16 0x0000000000433855 in connect_host(connect_host_report*) () at Util.cpp:339
#17 0x000000000043f57d in void boost::_bi::list1<boost::_bi::value<connect_host_report*> >::operator()<void (*)(connect_host_report*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(connect_host_report*), boost::_bi::list0&, int)
() at /home/utils/boost-1.55.0//include/boost/bind/bind.hpp:253
#18 0x000000000043f0f9 in boost::_bi::bind_t<void, void (*)(connect_host_report*), boost::_bi::list1<boost::_bi::value<connect_host_report*> > >::operator()() () at /home/utils/boost-1.55.0//include/boost/bind/bind_template.hpp:20
#19 0x000000000043e738 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(connect_host_report*), boost::_bi::list1<boost::_bi::value<connect_host_report*> > > >::run() ()
at /home/utils/boost-1.55.0//include/boost/thread/detail/thread.hpp:117
#20 0x00007ffff60995da in thread_proxy () from /home/utils/boost-1.55.0//lib/libboost_thread.so.1.55.0
#21 0x00007ffff6c86aa1 in start_thread (arg=0x7ffff1593700) at pthread_create.c:301
#22 0x00007ffff4a29aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
同时粘贴SSHConnector的相关代码(第44行ssh_connect是error的来源)
28 string error_info;
29
30 ssh_session_object = ssh_new();
31 if (ssh_session_object == NULL) {
32 error_info = "Could not create SSH session";
33 throw ConnectionUnsuccessfulException(hostname, error_info,
34 CLEANUP_NOT_REQUIRED);
35 }
36
37 ssh_options_set(ssh_session_object, SSH_OPTIONS_HOST, hostname.c_str());
38 ssh_options_set(ssh_session_object, SSH_OPTIONS_LOG_VERBOSITY,
39 &verbosity);
40 ssh_options_set(ssh_session_object, SSH_OPTIONS_TIMEOUT,
41 &ssh_connection_timeout);
42
43 int rc;
44 rc = ssh_connect(ssh_session_object);
也用 valgrind 进行了测试:
==641==
==641== HEAP SUMMARY:
==641== in use at exit: 16,104 bytes in 540 blocks
==641== total heap usage: 4,791 allocs, 4,251 frees, 559,189 bytes allocated
==641==
==641== Searching for pointers to 540 not-freed blocks
==641== Checked 1,006,816 bytes
==641==
==641== LEAK SUMMARY:
==641== definitely lost: 0 bytes in 0 blocks
==641== indirectly lost: 0 bytes in 0 blocks
==641== possibly lost: 0 bytes in 0 blocks
==641== still reachable: 16,104 bytes in 540 blocks
==641== suppressed: 0 bytes in 0 blocks
==641== Rerun with --leak-check=full to see details of leaked memory
==641==
==641== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
--641--
--641-- used_suppression: 4 U1004-ARM-_dl_relocate_object
--641-- used_suppression: 2 glibc-2.5.x-on-SUSE-10.2-(PPC)-2a
==641==
==641== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
任何人都可以提出导致此错误随机行为的原因吗?
1.1 之前的 OpenSSL 版本不是线程安全的——除非应用程序注册了自己的锁定回调。 libssh
提供了一组默认的锁定回调,但它的初始化序列本身不是线程安全的(在 0.7 版本中),因此您必须显式调用 ssh_init
或 ssh_threads_init
以避免竞争条件.