GNU Parallel 不跨远程主机返回输出值

GNU Parallel not returning output values across remote hosts

并行(通过 epelyum 下载,而不是 gnu parallels 站点)没有从分发到远程主机的进程返回值,我不确定为什么。

我正在尝试 运行 的并发作业类似于这个简单的 example...

[myuser]$  parallel -q -j 5 \
    --sshloginfile ./parallel-nodes.txt \
    echo "Number {}: Running on `hostname`" ::: 1 2 3 4 5 6 7 8 9 10
Number 9: Running on HW04.co.local
Number 3: Running on HW04.co.local
Number 5: Running on HW04.co.local
Number 8: Running on HW04.co.local
Number 2: Running on HW04.co.local
Number 6: Running on HW04.co.local
^C^C^C^C%  

这会挂起,直到我 ctl+c 退出(即只能从调用主机 运行 )。当不提供sshloginfile时,没有问题...

[myuser]$ parallel -q -j 5 echo "Number {}: Running on `hostname`" ::: 1 2 3 
Number 3: Running on HW04.co.local
Number 1: Running on HW04.co.local
Number 2: Running on HW04.co.local

当尝试 运行 这个并看到它挂起时,我尝试检查每个节点上可能与 parallel 命令相关的进程...

[root@HW01 ~]# clush -ab "ps -aux | grep echo"
---------------
HW01
---------------
root     136318  0.0  0.0 294648 16468 pts/2    Sl+  15:39   0:00 /usr/bin/python2 /usr/bin/clush -ab ps -aux | grep echo
root     136322  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW01 ps -aux | grep echo
root     136323  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW02 ps -aux | grep echo
root     136324  0.0  0.0 185096  4820 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW03 ps -aux | grep echo
root     136325  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW04 ps -aux | grep echo
root     136334  0.0  0.0 113176  1584 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     136351  0.0  0.0 112712   968 ?        S    15:39   0:00 grep echo
---------------
HW02
---------------
root      85835  0.0  0.0 113176  1580 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root      85846  0.0  0.0 112708   968 ?        S    15:39   0:00 grep echo
---------------
HW03
---------------
root     120282  0.0  0.0 113176  1576 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     120293  0.0  0.0 112708   968 ?        S    15:39   0:00 grep echo
---------------
HW04
---------------
hph_etl  113600  1.5  0.0 157516 11944 pts/2    S+   15:39   0:00 perl /bin/parallel -q -j 5 --sshloginfile /home/me/projects/myproject/parallel-nodes.txt echo Number {}: Running on HW04.co.local ::: 1 2 3 4 5 6 7 8 9 10
root     114154  0.0  0.0 113176  1584 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     114168  0.0  0.0 112712   960 ?        S    15:39   0:00 grep echo

所以看起来好像命令根本没有传送到其他节点,只是停留在调用节点(这里是 HW04)上。然而,检查 firewalld 是否在任何主机上 运行ning...

[root@HW01 ~]# clush -ab systemctl status firewalld
---------------
HW01
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)
---------------
HW02
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

Jul 16 15:17:27 HW02.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:17:28 HW02.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:32 HW02.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:33 HW02.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW03
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

Jul 16 15:11:15 HW03.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:11:16 HW03.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:46 HW03.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:47 HW03.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW04
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-07-25 15:00:33 HST; 4 days ago
     Docs: man:firewalld(1)
  Process: 3303 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 3303 (code=exited, status=0/SUCCESS)

Jul 25 15:00:32 HW04.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 25 15:00:33 HW04.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
clush: HW[01-04] (4): exited with exit code 3

显示它在所有主机上都处于非活动状态。

此时,不确定发生了什么问题。谁能想到任何调试建议或修复?

** 此外,当在命令中包含 --bibtex 选项时,上面列出的命令均无效。有谁知道为什么会这样?

在您 link 的示例中,看看反引号是如何反斜杠的?您需要这样做,否则 hostname 在与其他机器通信之前会在 HW04 上的 shell 中执行。

首先,我会试试看您是否在与其他机器通话:

parallel -j 5 \
    --sshloginfile ./parallel-nodes.txt \
    echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4 5 6 7 8 9 10

然后,我会尝试在一台机器上跟踪您的无密码 ssh 设置,以确保它确实有效;来自 HW04 尝试:

parallel -S HW01 'echo -n {} ""; hostname' ::: 1
parallel -S HW02 'echo -n {} ""; hostname' ::: 1
parallel -S HW03 'echo -n {} ""; hostname' ::: 1
parallel -S HW04 'echo -n {} ""; hostname' ::: 1

(对 parallel-nodes.txt 文件中的每台机器重复)

如果其中一台机器无法使用 ssh,您可以尝试使用以下方法对其进行调试:

PARALLEL_SSH='ssh -v' parallel -S HW03 'echo -n {} ""; hostname' ::: 1