如果无限 wait() 已经启动,则信号处理程序挂起 Popen.wait(timeout)

signal handler hangs in Popen.wait(timeout) if an infinite wait() was started already

我遇到了一个 Python 子流程问题,我在 Python 3.6 和 3.7 上复制了这个问题,但我不明白。我有一个程序,称之为 Main,它使用 subprocess.Popen() 启动一个外部进程,称之为“Slave”。主程序注册了一个 SIGTERM 信号处理程序。 Main 使用 proc.wait(None) 或 proc.wait(timeout) 等待 Slave 进程完成。可以通过向 Main 发送 SIGTERM 信号来中断 Slave 进程。 sigterm 处理程序将 SIGINT 信号发送到从设备并等待(30)它终止。如果 Main 使用 wait(None),那么 sigterm 处理程序的 wait(30) 将等待整整 30 秒,即使从属进程已终止。如果 Main 使用 wait(timeout) 版本,那么一旦 Slave 终止,sigterm 处理程序的 wait(30) 将 return。

这是一个演示该问题的小型测试应用程序。 运行 它通过 python wait_test.py 使用非超时等待(None)。 运行 它通过 python wait_test.py <timeout value> 为 Main 等待提供特定的超时。

程序 运行ning 后,执行 kill -15 <pid> 并查看应用的反应。

#
# Save this to a file called wait_test.py
#
import signal
import subprocess
import sys
from datetime import datetime

slave_proc = None


def sigterm_handler(signum, stack):
    print("Process received SIGTERM signal {} while processing job!".format(signum))
    print("slave_proc is {}".format(slave_proc))

    if slave_proc is not None:
        try:
            print("{}: Sending SIGINT to slave.".format(datetime.now()))
            slave_proc.send_signal(signal.SIGINT)
            slave_proc.wait(30)
            print("{}: Handler wait completed.".format(datetime.now()))
        except subprocess.TimeoutExpired:
            slave_proc.terminate()
        except Exception as exception:
            print('Sigterm Exception: {}'.format(exception))
            slave_proc.terminate()
            slave_proc.send_signal(signal.SIGKILL)


def main(wait_val=None):
    with open("stdout.txt", 'w+') as stdout:
        with open("stderr.txt", 'w+') as stderr:
            proc = subprocess.Popen(["python", "wait_test.py", "slave"],
                                    stdout=stdout,
                                    stderr=stderr,
                                    universal_newlines=True)

    print('Slave Started')

    global slave_proc
    slave_proc = proc

    try:
        proc.wait(wait_val)    # If this is a no-timeout wait, ie: wait(None), then will hang in sigterm_handler.
        print('Slave Finished by itself.')
    except subprocess.TimeoutExpired as te:
        print(te)
        print('Slave finished by timeout')
        proc.send_signal(signal.SIGINT)
        proc.wait()

    print("Job completed")


if __name__ == '__main__':
    if len(sys.argv) > 1 and sys.argv[1] == 'slave':
        while True:
            pass

    signal.signal(signal.SIGTERM, sigterm_handler)
    main(int(sys.argv[1]) if len(sys.argv) > 1 else None)
    print("{}: Exiting main.".format(datetime.now()))

这是两个 运行 的示例:

Note here the 30 second delay
--------------------------------
[mkurtz@localhost testing]$ python wait_test.py
Slave Started
Process received SIGTERM signal 15 while processing job!
slave_proc is <subprocess.Popen object at 0x7f79b50e8d90>
2022-03-30 11:08:15.526319: Sending SIGINT to slave.   <--- 11:08:15
Slave Finished by itself.
Job completed
2022-03-30 11:08:45.526942: Exiting main.              <--- 11:08:45


Note here the instantaneous shutdown
-------------------------------------
[mkurtz@localhost testing]$ python wait_test.py 100
Slave Started
Process received SIGTERM signal 15 while processing job!
slave_proc is <subprocess.Popen object at 0x7fa2412a2dd0>
2022-03-30 11:10:03.649931: Sending SIGINT to slave.   <--- 11:10:03.649
2022-03-30 11:10:03.653170: Handler wait completed.    <--- 11:10:03.653
Slave Finished by itself.
Job completed
2022-03-30 11:10:03.673234: Exiting main.              <--- 11:10:03.673

这些特定测试是 运行 在 CentOS 7 上使用 Python 3.7.9。 有人可以解释这种行为吗?

Popen class 有一个 internal lock for wait operations:

        # Held while anything is calling waitpid before returncode has been
        # updated to prevent clobbering returncode if wait() or poll() are
        # called from multiple threads at once.  After acquiring the lock,
        # code must re-check self.returncode to see if another thread just
        # finished a waitpid() call.
        self._waitpid_lock = threading.Lock()

wait() and wait(timeout=...)的主要区别是前者持有锁无限期等待,而后者是释放锁的忙循环锁定每次迭代.

            if timeout is not None:
                ...
                while True:
                    if self._waitpid_lock.acquire(False):
                        try:
                            ...
                            # wait without any delay
                            (pid, sts) = self._try_wait(os.WNOHANG)
                            ...
                        finally:
                            self._waitpid_lock.release()
                    ...
                    time.sleep(delay)
            else:
                while self.returncode is None:
                    with self._waitpid_lock:  # acquire lock unconditionally
                        ...
                        # wait indefinitley
                        (pid, sts) = self._try_wait(0)

这对于常规并发代码来说不是问题 - 即 threading - 因为线程 运行 wait() 和持有锁将在子进程完成后立即被唤醒.这反过来又允许等待 lock/subprocess 的所有其他线程迅速进行。


但是,当 a) main 线程持有 wait() 中的锁和 b) 信号处理程序 试图等待。信号处理程序的一个微妙之处在于它们会中断主线程:

signal: Signals and Threads

Python signal handlers are always executed in the main Python thread of the main interpreter, even if the signal was received in another thread. […]

由于信号处理程序在主线程中运行,主线程的常规代码执行将暂停,直到信号处理程序完成!

通过信号处理程序中的 运行 wait,a) 信号处理程序阻塞等待锁,b) 锁阻塞等待信号处理程序。只有在信号处理程序 wait 超时后,“主线程”才会恢复,收到 suprocess 完成的确认,设置 return 代码并释放锁。