在 python 的 concurrent.futures 中查找 BrokenProcessPool 的原因
Finding the cause of a BrokenProcessPool in python's concurrent.futures
一言以蔽之
我在使用 concurrent.futures
并行化我的代码时遇到 BrokenProcessPool
异常。不会显示更多错误。我想找到错误的原因并询问如何做到这一点的想法。
满题
我正在使用 concurrent.futures 来并行化一些代码。
with ProcessPoolExecutor() as pool:
mapObj = pool.map(myMethod, args)
我最终遇到(且仅遇到)以下异常:
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
不幸的是,程序很复杂,只有在程序运行 30分钟后才会出现错误。因此,我无法提供一个很好的最小示例。
为了找出问题的原因,我将 运行 的方法与 try-except-block 并行包装:
def myMethod(*args):
try:
...
except Exception as e:
print(e)
问题仍然存在,从未输入 except 块。我得出结论,异常不是来自我的代码。
我的下一步是编写自定义 ProcessPoolExecutor
class,它是原始 ProcessPoolExecutor
的子项,并允许我用自定义方法替换一些方法。我复制并粘贴了方法 _process_worker
的原始代码并添加了一些打印语句。
def _process_worker(call_queue, result_queue):
"""Evaluates calls from call_queue and places the results in result_queue.
...
"""
while True:
call_item = call_queue.get(block=True)
if call_item is None:
# Wake up queue management thread
result_queue.put(os.getpid())
return
try:
r = call_item.fn(*call_item.args, **call_item.kwargs)
except BaseException as e:
print("??? Exception ???") # newly added
print(e) # newly added
exc = _ExceptionWithTraceback(e, e.__traceback__)
result_queue.put(_ResultItem(call_item.work_id, exception=exc))
else:
result_queue.put(_ResultItem(call_item.work_id,
result=r))
同样,except
块从未进入。这是意料之中的,因为我已经确保我的代码不会引发异常(如果一切正常,应该将异常传递给主进程)。
现在我不知道如何找到错误。此处引发异常:
def submit(self, fn, *args, **kwargs):
with self._shutdown_lock:
if self._broken:
raise BrokenProcessPool('A child process terminated '
'abruptly, the process pool is not usable anymore')
if self._shutdown_thread:
raise RuntimeError('cannot schedule new futures after shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._pending_work_items[self._queue_count] = w
self._work_ids.put(self._queue_count)
self._queue_count += 1
# Wake up queue management thread
self._result_queue.put(None)
self._start_queue_management_thread()
return f
这里设置进程池断:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
进程终止是(或似乎是)事实,但我不知道为什么。到目前为止我的想法是否正确? 导致进程在没有消息的情况下终止的可能原因是什么? (这甚至可能吗?)我可以在哪里应用进一步的诊断?为了更接近解决方案,我应该问自己哪些问题?
我在 64 位 Linux 上使用 python 3.5。
我想我能走得尽可能远:
我在更改后的 ProcessPoolExecutor
模块中更改了 _queue_management_worker
方法,以便打印失败进程的退出代码:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else:
# BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
vals = list(processes.values())
for s in ready:
j = sentinels.index(s)
print("is_alive()", vals[j].is_alive())
print("exitcode", vals[j].exitcode)
# -------------------------------------------
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
后来查了一下退出码的意思:
from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])
其中 my_exit_code
是打印在我插入到 _queue_management_worker
的块中的退出代码。在我的例子中,代码是 -11,这意味着我 运行 进入了分段错误。找到这个问题的原因将是一项艰巨的任务,但超出了这个问题的范围。
如果您使用的是 macOS,则存在一个已知问题,即某些版本的 macOS 使用分叉的方式在某些情况下 Python 不认为分叉是安全的。对我有用的解决方法是使用 no_proxy 环境变量。
编辑 ~/.bash_profile 并包括以下内容(最好在此处指定域或子网列表,而不是 *)
no_proxy='*'
刷新当前上下文
source ~/.bash_profile
我发现并解决了这个问题的本地版本是:Python 3.6.0
macOS 10.14.1 和 10.13.x
一言以蔽之
我在使用 concurrent.futures
并行化我的代码时遇到 BrokenProcessPool
异常。不会显示更多错误。我想找到错误的原因并询问如何做到这一点的想法。
满题
我正在使用 concurrent.futures 来并行化一些代码。
with ProcessPoolExecutor() as pool:
mapObj = pool.map(myMethod, args)
我最终遇到(且仅遇到)以下异常:
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
不幸的是,程序很复杂,只有在程序运行 30分钟后才会出现错误。因此,我无法提供一个很好的最小示例。
为了找出问题的原因,我将 运行 的方法与 try-except-block 并行包装:
def myMethod(*args):
try:
...
except Exception as e:
print(e)
问题仍然存在,从未输入 except 块。我得出结论,异常不是来自我的代码。
我的下一步是编写自定义 ProcessPoolExecutor
class,它是原始 ProcessPoolExecutor
的子项,并允许我用自定义方法替换一些方法。我复制并粘贴了方法 _process_worker
的原始代码并添加了一些打印语句。
def _process_worker(call_queue, result_queue):
"""Evaluates calls from call_queue and places the results in result_queue.
...
"""
while True:
call_item = call_queue.get(block=True)
if call_item is None:
# Wake up queue management thread
result_queue.put(os.getpid())
return
try:
r = call_item.fn(*call_item.args, **call_item.kwargs)
except BaseException as e:
print("??? Exception ???") # newly added
print(e) # newly added
exc = _ExceptionWithTraceback(e, e.__traceback__)
result_queue.put(_ResultItem(call_item.work_id, exception=exc))
else:
result_queue.put(_ResultItem(call_item.work_id,
result=r))
同样,except
块从未进入。这是意料之中的,因为我已经确保我的代码不会引发异常(如果一切正常,应该将异常传递给主进程)。
现在我不知道如何找到错误。此处引发异常:
def submit(self, fn, *args, **kwargs):
with self._shutdown_lock:
if self._broken:
raise BrokenProcessPool('A child process terminated '
'abruptly, the process pool is not usable anymore')
if self._shutdown_thread:
raise RuntimeError('cannot schedule new futures after shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._pending_work_items[self._queue_count] = w
self._work_ids.put(self._queue_count)
self._queue_count += 1
# Wake up queue management thread
self._result_queue.put(None)
self._start_queue_management_thread()
return f
这里设置进程池断:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
进程终止是(或似乎是)事实,但我不知道为什么。到目前为止我的想法是否正确? 导致进程在没有消息的情况下终止的可能原因是什么? (这甚至可能吗?)我可以在哪里应用进一步的诊断?为了更接近解决方案,我应该问自己哪些问题?
我在 64 位 Linux 上使用 python 3.5。
我想我能走得尽可能远:
我在更改后的 ProcessPoolExecutor
模块中更改了 _queue_management_worker
方法,以便打印失败进程的退出代码:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else:
# BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
vals = list(processes.values())
for s in ready:
j = sentinels.index(s)
print("is_alive()", vals[j].is_alive())
print("exitcode", vals[j].exitcode)
# -------------------------------------------
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
后来查了一下退出码的意思:
from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])
其中 my_exit_code
是打印在我插入到 _queue_management_worker
的块中的退出代码。在我的例子中,代码是 -11,这意味着我 运行 进入了分段错误。找到这个问题的原因将是一项艰巨的任务,但超出了这个问题的范围。
如果您使用的是 macOS,则存在一个已知问题,即某些版本的 macOS 使用分叉的方式在某些情况下 Python 不认为分叉是安全的。对我有用的解决方法是使用 no_proxy 环境变量。
编辑 ~/.bash_profile 并包括以下内容(最好在此处指定域或子网列表,而不是 *)
no_proxy='*'
刷新当前上下文
source ~/.bash_profile
我发现并解决了这个问题的本地版本是:Python 3.6.0 macOS 10.14.1 和 10.13.x