挂在 google/protobuf/pyext/_message.so 出口处

hang in google/protobuf/pyext/_message.so at exit

这是通过 pip 安装的 TensorFlow 1.0.1。 它通过嵌入式 CPython (libpython) 运行。

有时(可能是我运行的 30%)它挂在 Py_Finalize(),我看到这个回溯:

/work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/Flf/flf-tool.linux-intel-standard(_ZN17AssertionsPrivate15safe_stackTraceEi+0x21)[0xc5b891]
/work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/Flf/flf-tool.linux-intel-standard[0xc5b8ef]
/u/zeyer/tools/glibc217/libpthread.so.0(+0x113d0)[0x2b6d89bad3d0]
/u/zeyer/tools/glibc217/libpthread.so.0(raise+0x29)[0x2b6d89bad2a9]
/u/zeyer/py-envs/py2-ubuntu16/local/lib/python2.7/site-packages/faulthandler.so(+0x3198)[0x2b6dc2372198]
/u/zeyer/tools/glibc217/libpthread.so.0(+0x113d0)[0x2b6d89bad3d0]
/u/zeyer/py-envs/py2-ubuntu16/local/lib/python2.7/site-packages/google/protobuf/pyext/_message.so(+0xaa943)[0x2b6dc14f0943]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x160f6b)[0x2b6d8b23af6b]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xc8f0e)[0x2b6d8b1a2f0e]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x15d747)[0x2b6d8b237747]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyDict_SetItem+0x7b)[0x2b6d8b23becb]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(_PyModule_Clear+0xb5)[0x2b6d8b278565]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyImport_Cleanup+0x437)[0x2b6d8b2280e7]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(Py_Finalize+0xfe)[0x2b6d8b1fed9e]
/work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/Flf/flf-tool.linux-intel-standard(_ZN6Python11Initializer19AtExitUninitHandlerEv+0x2e)[0xff80de]
/u/zeyer/tools/glibc217/libc.so.6(+0x39fe8)[0x2b6d8bc39fe8]
/u/zeyer/tools/glibc217/libc.so.6(+0x3a035)[0x2b6d8bc3a035]
/u/zeyer/tools/glibc217/libc.so.6(__libc_start_main+0xf7)[0x2b6d8bc20837]
/work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/Flf/flf-tool.linux-intel-standard[0x7d6991]

或使用 GDB:

(gdb) bt full
#0  0x00002b6dc14f0943 in std::tr1::_Hashtable<google::protobuf::DescriptorPool const*, std::pair<google::protobuf::DescriptorPool const* const, google::protobuf::python::PyDescriptorPool*>, std::allocator<std::pair<google::protobuf::DescriptorPool const* const, google::protobuf::python::PyDescriptorPool*> >, std::_Select1st<std::pair<google::protobuf::DescriptorPool const* const, google::protobuf::python::PyDescriptorPool*> >, std::equal_to<google::protobuf::DescriptorPool const*>, google::protobuf::hash<google::protobuf::DescriptorPool const*>, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::erase (
    __k=@0x7ffd1bbea740: 0x8269780, this=0x2b6dc1826e40 <google::protobuf::python::descriptor_pool_map>)
    at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/tr1/hashtable.h:1041
        __slot = <optimized out>
        __saved_slot = <optimized out>
        __code = 136746880
        __n = 0
        __result = 0
#1  google::protobuf::python::cdescriptor_pool::Dealloc (self=0x2b6dc0d86880)
    at google/protobuf/pyext/descriptor_pool.cc:152
No locals.
#2  0x00002b6d8b23af6b in ?? () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#3  0x00002b6d8b1a2f0e in ?? () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#4  0x00002b6d8b237747 in ?? () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#5  0x00002b6d8b23becb in PyDict_SetItem () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#6  0x00002b6d8b278565 in _PyModule_Clear () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#7  0x00002b6d8b2280e7 in PyImport_Cleanup () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#8  0x00002b6d8b1fed9e in Py_Finalize () from /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0
No symbol table info available.
#9  0x0000000000ff80de in Python::Initializer::AtExitUninitHandler() ()
No symbol table info available.
#10 0x00002b6d8bc39fe8 in ?? () from /u/zeyer/tools/glibc217/libc.so.6
No symbol table info available.
#11 0x00002b6d8bc3a035 in exit () from /u/zeyer/tools/glibc217/libc.so.6
No symbol table info available.
#12 0x00002b6d8bc20837 in __libc_start_main () from /u/zeyer/tools/glibc217/libc.so.6
No symbol table info available.
#13 0x00000000007d6991 in _start ()
No symbol table info available.

即它发生在 _PyModule_Clear 中,然后在 google/protobuf/pyext/_message.so 中,这就是为什么我认为这与 TF 相关。

在没有挂起的情况下,我看到这个输出:

Exception AttributeError: AttributeError("'NoneType' object has no attribute 'raise_exception_on_not_ok_status'",) in <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x2afd625b12d0>> ignored

我也 asked upstream on TF 但他们建议 post 在这里。

知道它为什么会挂起以及如何解决这个问题吗?

请注意,此崩溃是通过 std::atexit 在回调中发生的。我想问题是在我从我的 atexit-handler 调用 Py_Finalize 之前清理了 Google 或 std 的一些东西,这导致了这次崩溃。不过我认为这不应该发生。

无论如何,我现在不使用 std::atexit 而是使用我自己的退出处理程序逻辑来解决这个问题(但是,如果我在任何地方直接使用 exit(),这将不起作用)。