如果分配了伪终端，为什么运行通过 ssh 的后台任务会失败？

Question

我最近运行在通过 ssh 运行ning 命令时遇到了一些稍微奇怪的行为。我很想听听对以下行为的任何解释。

运行 ssh localhost 'touch foobar &' 按预期创建名为 foobar 的文件：

[bob@server ~]$ ssh localhost 'touch foobar &'
[bob@server ~]$ ls foobar
foobar

但是运行使用相同的命令但使用 -t 选项强制伪 tty 分配无法创建 foobar:

[bob@server ~]$ ssh -t localhost 'touch foobar &'
Connection to localhost closed.
[bob@server ~]$ echo $?
0
[bob@server ~]$ ls foobar
ls: cannot access foobar: No such file or directory

我目前的理论是，因为触摸进程正在后台运行，伪 tty 在进程有机会运行之前被分配和取消分配。当然，增加一秒钟的睡眠可以按预期触摸到运行：

[bob@pidora ~]$ ssh -t localhost 'touch foobar & sleep 1'
Connection to localhost closed.
[bob@pidora ~]$ ls foobar
foobar

如果有人有明确的解释，我会很想听听。谢谢。

Answer 1

哦，不错哦。

这与进程组的工作方式、bash 作为非交互式 shell 与 -c 调用时的行为方式以及 & 在输入命令。

答案假定您熟悉作业控制在 UNIX 中的工作方式；如果你不是，这是一个高级视图：每个进程都属于一个进程组（同一组中的进程通常作为命令管道的一部分放在那里，例如 cat file | sort | grep 'word' 将进程运行ning cat(1)、sort(1) 和 grep(1) 在同一进程组中）。 bash 和其他进程一样是一个进程，它也属于一个进程组。进程组是会话的一部分（一个会话由一个或多个进程组组成）。在一个会话中，最多有一个进程组，称为前台进程组，可能还有很多后台进程组。前台进程组控制终端（如果有一个控制终端连接到会话）；会话负责人 (bash) 将进程从后台移动到前台，并使用 tcsetpgrp(3) 从前台移动到后台。发送到进程组的信号会传递到该组中的每个进程。

如果进程组和作业控制的概念对您来说是全新的，我认为您需要仔细阅读以完全理解这个答案。 UNIX 环境中的高级编程（第 3 版）的第 9 章是了解这一点的重要资源。

话虽如此，让我们看看这里发生了什么。我们必须拼好每一块拼图。

在这两种情况下，ssh 远程端调用 bash(1) 和 -c。 -c 标志导致 bash(1) 到运行作为非交互式 shell。来自联机帮助页：

An interactive shell is one started without non-option arguments and without the -c option whose standard input and error are both connected to terminals (as determined by isatty(3)), or one started with the -i option. PS1 is set and $- includes i if bash is interactive, allowing a shell script or a startup file to test this state.

此外，重要的是要知道当 bash 在非交互模式下启动时 作业控制被禁用 。这意味着 bash 不会为运行命令创建一个单独的进程组，因为作业控制被禁用，所以不需要在前台和后台之间移动这个命令，所以它还不如只是与 bash 保持在同一个进程组中。无论您是否使用 -t.

在 ssh 上强制分配 PTY，都会发生这种情况

但是，使用 & 会导致 shell 不等待命令终止（即使禁用作业控制）。来自联机帮助页：

If a command is terminated by the control operator &, the shell executes the command in the background in a subshell. The shell does not wait for the command to finish, and the return status is 0. Commands separated by a ; are executed sequentially; the shell waits for each command to terminate in turn. The return status is the exit status of the last command executed.

所以，在这两种情况下，bash都不会等待命令执行，而touch(1)会和bash(1)在同一个进程组中执行。

现在，考虑当会话负责人退出时会发生什么。引自 setpgid(2) 手册页：

If a session has a controlling terminal, and the CLOCAL flag for that terminal is not set, and a terminal hangup occurs, then the session leader is sent a SIGHUP. If the session leader exits, then a SIGHUP signal will also be sent to each process in the foreground process group of the controlling terminal.

（强调我的）

不使用时-t

当你不使用-t时，远端没有PTY分配，所以bash不是会话领导者，实际上没有创建新会话。因为 sshd 运行ning 是一个守护进程，所以分叉 + exec() 的 bash 进程将没有控制终端。因此，即使 shell 终止得非常快（可能在 touch(1) 之前），也没有 SIGHUP 发送到进程组，因为 bash 不是会话领导者（并且没有控制终端）。所以一切正常。

当你使用-t

-t 强制 PTY 分配，这意味着 ssh 远程端将调用 setsid(2)，分配一个伪终端 + 用 forkpty(3) 分叉一个新进程，连接 PTY 主机设备输入输出到通向你机器的socket端点，最后执行bash(1)。 forkpty(3) 在forked进程中打开PTY slave端将成为bash；由于当前会话没有控制终端，并且正在打开终端设备，因此 PTY 设备成为会话的控制终端，bash 成为会话领导者。

然后又发生同样的事情：touch(1)在同一个进程组中执行，等等，yadda yadda。关键是，这一次，是一个会话领导者和一个控制终端。因此，由于 bash 不会因为 & 而费心等待，当它退出时，SIGHUP 被交付给进程组并且 touch(1) 过早死亡。

关于nohup

nohup(1) 在这里不起作用，因为仍然存在竞争条件。如果 bash(1) 在 nohup(1) 有机会设置必要的信号处理和文件重定向之前终止，它将没有任何效果（这可能会发生）

可能的修复

强制重新启用作业控制可以修复它。在 bash 中，您可以使用 set -m 执行此操作。这有效：

ssh -t localhost 'set -m ; touch foobar &'

或强制bash等待touch(1)完成：

ssh -t localhost 'touch foobar & wait `pgrep touch`'

Answer 2

关键是将 child 进程的 stdin/stdout/stderr 流与原始 bash/ssh 会话分离；然后 pseudo-tty 分配 (ssh -t) 不再需要允许 child 在 ssh 连接终止后继续存在。请参阅 here 以获得完整答案...

Answer 3

@Filipe Gonçalves 的回答很好，但有问题。我没有足够的声誉在那里发表评论，所以我 correct/enrich 内容在这里：

When you don't use -t,

@Filipe says:

When you don't use -t, there is no PTY allocation on the remote side, so bash is not a session leader, and in fact no new session is created. ...

实际上，bash是会话负责人，并创建了新会话。

让我们测试一下：

# run sleep background process first, then call ps directly:
[root@90fb1c3f30ce ~]# ssh localhost  'sleep 66 & ps -o pid,ppid,pgid,sess,tpgid,tty,args'
    PID    PPID    PGID    SESS   TPGID TT       COMMAND
 184074      67  184074  184074      -1 ?        sshd: root@notty
 184076  184074  184076  184076      -1 ?        bash -c sleep 66 & ps -o pid,ppid,pgid,sess,tpgid,tty,args
 184081  184076  184076  184076      -1 ?        sleep 66
 184082  184076  184076  184076      -1 ?        ps -o pid,ppid,pgid,sess,tpgid,tty,args

Notice           ^^^^^   ^^^^^

我们可以看到这些 bash/sleep/ps 进程有相同的 PGID/SESS 等于 184076 bash 进程，但 sshd 父进程有一个不同的 PGID/SESS。这里，bash进程是新会话的领导者，bash/sleep/ps进程属于另一个进程组。

此外，我们可以发现ssh命令并没有立即return，它仍然等待大约66秒。你可以在这里找到它的原因：Getting ssh to execute a command in the background on target machine

在ssh命令等待期间，我们可以打开另一个会话运行:

[root@90fb1c3f30ce ~]# ps -eo pid,ppid,pgid,sess,tpgid,tty,args
    PID    PPID    PGID    SESS   TPGID TT       COMMAND
    # unrelated lines removed #
 184074      67  184074  184074      -1 ?        sshd: root@notty
 184081       1  184076  184076      -1 ?        sleep 66
Notice           ^^^^^   ^^^^^

[root@90fb1c3f30ce ~]# ps -e | grep 184076
[root@90fb1c3f30ce ~]#

我们可以看到bash进程（pid 184076）已经消失了，但是睡眠后台进程PGID/SESS没有变化。没关系，APUE session 9.4:

Each prcoess group can have a process group leader. The leader is identified by its process group ID being equal to its process ID.

It is possible for a process group leader to create a process group, create processes in the group, and then terminate. The process group still exists, as long as at least one process is in the group, regardless of whether the group leader terminates.

那么，这个sleep进程为什么不死掉呢？

不使用-t时，remote端没有分配PTY，所以remote端的prcoess group不是前台进程组（没有终端，没有意义前景或背景)。因此，即使 shell 很快终止，也没有 SIGHUP 发送到它的进程组，因为进程组不是前台进程组。（SIGHUP信号将发送到控制终端前台进程组中的每个进程）。

如果分配了伪终端，为什么运行通过 ssh 的后台任务会失败？

Why does running a background task over ssh fail if a pseudo-tty is allocated?

ssh

bash

jobs

tty

pty

如果分配了伪终端，为什么 运行 通过 ssh 的后台任务会失败？

Why does running a background task over ssh fail if a pseudo-tty is allocated?

ssh

bash

jobs

tty

pty

如果分配了伪终端，为什么运行通过 ssh 的后台任务会失败？