Python:Thread.is_alive *究竟*是什么意思?

Python: what does Thread.is_alive *exactly* mean?

在 Python 3.9.10 中,我遇到了以下非常令人不安的行为:

class MyThread(threading.Thread):
    def run(self):
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # prints FALSE!!!
        ...  # do some work involving asyncio and networking
        ...  # (specifically, I'm using aiohttp) and I know this work is
        ...  # actually being done because I can see its side-effects
        ...  # from across the network.
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # prints False AGAIN!!!
        ...  # go on with that work (still detectable and detected)
        liveness = self.is_alive()
        logging.debug(f"Am I alive? {liveness}")  # still False...

在某些情况下,对 is_alive() 的调用是 returning False。现在,我没有做任何奇怪的事情,比如在 MyThread 中重新定义我不应该做的方法,或者乱搞任何东西的内部结构。

我的问题是,在正常情况下,是否存在线程启动后Thread.is_alive会returnFalse,但还在工作的情况? (顺便说一句,主要是 Python 工作,而不是后台的一些 C 代码 运行ning。)

更多详情

有一个主线程和两个辅助线程。它继续这样的事情。主线程中的以下代码运行s:

exit_signal = threading.Event()
workers = {}
# pass them the exit_signal so they know when to stop:
workers['connect_to_server_1'] = MyThread("server1.com", exit_signal)
workers['connect_to_server_1'].start()
workers['connect_to_server_2'] = MyThread("server2.com", exit_signal)
workers['connect_to_server_2'].start()

# wait until the process gets a SIGINT (user hits ^C)
try:
    for w in workers.values():
        w.join()  # will never return
except KeyboardInterrupt:
    logging.info("ok, user wants to quit, let's quit")
else:
    logging.critical("threads have quit on their own")  # never happens

# list thread statuses
w = workers['connect_to_server_1']
logging.debug(f"Is {w} alive? {w.is_alive()}")  # prints FALSE
w = workers['connect_to_server_2']
logging.debug(f"Is {w} alive? {w.is_alive()}")  # prints TRUE

# For debugging purposes, give the workers some more time to keep doing
# their jobs. This here is an interesting time window: the main thread
# has already received ^C, but the workers are supposedly not aware of
# that.
time.sleep(10)
# Finally, tell workers to stop, and wait for them to go:
exit_signal.set()
workers['connect_to_server_1'].join()
workers['connect_to_server_2'].join()
logging.info("all good, bye!")

这是怎么回事

结束语

这段代码已经 运行 在 Python 3.6 中使用了几个月,通常有大约 15 个工人而不是 2 个,这个问题从未发生过。只有当我在 Python 3.9 中尝试 运行 它时才会发生。它有点容易重现:当我 运行 使用 Python 3.9 的服务时,大约有一半时间一切正常,但在另一半时间我被这个僵尸线程吓坏了,告诉我它已经死了,但它在跟我说话。

此外,僵尸线程始终是与 一个特定服务器 对话的线程,这让我认为这可能是该服务器的 SSL 证书或其实现的问题的 websocket 协议,但无论如何,我不控制那个服务器。我控制的是 threading.Thread 的 this 实例,它应该 要么死了要么直立行走,但不能同时.

我在这里错过了什么?

原来这是线程实现中的一个 recently-introduced 错误。 Thread.join调用了内部方法Thread._wait_for_tstate_lock,那个方法最近是changed to look like this:

try:
    if lock.acquire(block, timeout):
        lock.release()
        self._stop()
except:
    if lock.locked():
        # bpo-45274: lock.acquire() acquired the lock, but the function
        # was interrupted with an exception before reaching the
        # lock.release(). It can happen if a signal handler raises an
        # exception, like CTRL+C which raises KeyboardInterrupt.
        lock.release()
        self._stop()
    raise

if lock.locked() 检查试图解决挂起问题,如果此方法在 lock.acquirelock.release 之后立即被 Ctrl-C 中断,但是支票错了。它不检查之前的 lock.acquire 调用是否获得了锁。它只是检查锁是否完全锁定!锁几乎 总是 锁定,特别是,它应该在线程处于活动状态的整个过程中被锁定。

这意味着,如果您使用 Ctrl-C 中断此方法中的 lock.acquire 调用,代码将释放锁(其他人正在持有)并调用 self._stop 来执行 end-of-thread 清理,包括将线程标记为不再活动。这就是为什么您的 is_alive 呼叫返回 False.