GNU Parallel 不跨远程主机返回输出值
GNU Parallel not returning output values across remote hosts
并行(通过 epel 从 yum
下载,而不是 gnu parallels 站点)没有从分发到远程主机的进程返回值,我不确定为什么。
我正在尝试 运行 的并发作业类似于这个简单的 example...
[myuser]$ parallel -q -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on `hostname`" ::: 1 2 3 4 5 6 7 8 9 10
Number 9: Running on HW04.co.local
Number 3: Running on HW04.co.local
Number 5: Running on HW04.co.local
Number 8: Running on HW04.co.local
Number 2: Running on HW04.co.local
Number 6: Running on HW04.co.local
^C^C^C^C%
这会挂起,直到我 ctl+c 退出(即只能从调用主机 运行 )。当不提供sshloginfile
时,没有问题...
[myuser]$ parallel -q -j 5 echo "Number {}: Running on `hostname`" ::: 1 2 3
Number 3: Running on HW04.co.local
Number 1: Running on HW04.co.local
Number 2: Running on HW04.co.local
- 我可以确认
--sshloginfile
中的所有节点都启用了无密码 ssh,并且可以 ssh passwordless 在所有涉及的节点之间。
- 还可以确认所有相关节点上都安装了 gnu parallels。
- 并且调用
parallel
的用户存在于所有涉及的节点上
- 以及检查出现在
sshloginfile
中的所有主机 FQDN 在相关主机的 .ssh/known_hosts
文件中的名称是否相同。
当尝试 运行 这个并看到它挂起时,我尝试检查每个节点上可能与 parallel
命令相关的进程...
[root@HW01 ~]# clush -ab "ps -aux | grep echo"
---------------
HW01
---------------
root 136318 0.0 0.0 294648 16468 pts/2 Sl+ 15:39 0:00 /usr/bin/python2 /usr/bin/clush -ab ps -aux | grep echo
root 136322 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW01 ps -aux | grep echo
root 136323 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW02 ps -aux | grep echo
root 136324 0.0 0.0 185096 4820 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW03 ps -aux | grep echo
root 136325 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW04 ps -aux | grep echo
root 136334 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 136351 0.0 0.0 112712 968 ? S 15:39 0:00 grep echo
---------------
HW02
---------------
root 85835 0.0 0.0 113176 1580 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 85846 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW03
---------------
root 120282 0.0 0.0 113176 1576 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 120293 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW04
---------------
hph_etl 113600 1.5 0.0 157516 11944 pts/2 S+ 15:39 0:00 perl /bin/parallel -q -j 5 --sshloginfile /home/me/projects/myproject/parallel-nodes.txt echo Number {}: Running on HW04.co.local ::: 1 2 3 4 5 6 7 8 9 10
root 114154 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 114168 0.0 0.0 112712 960 ? S 15:39 0:00 grep echo
所以看起来好像命令根本没有传送到其他节点,只是停留在调用节点(这里是 HW04)上。然而,检查 firewalld
是否在任何主机上 运行ning...
[root@HW01 ~]# clush -ab systemctl status firewalld
---------------
HW01
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
---------------
HW02
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:17:27 HW02.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:17:28 HW02.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:32 HW02.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:33 HW02.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW03
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:11:15 HW03.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:11:16 HW03.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:46 HW03.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:47 HW03.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW04
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Thu 2019-07-25 15:00:33 HST; 4 days ago
Docs: man:firewalld(1)
Process: 3303 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 3303 (code=exited, status=0/SUCCESS)
Jul 25 15:00:32 HW04.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 25 15:00:33 HW04.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
clush: HW[01-04] (4): exited with exit code 3
显示它在所有主机上都处于非活动状态。
此时,不确定发生了什么问题。谁能想到任何调试建议或修复?
** 此外,当在命令中包含 --bibtex
选项时,上面列出的命令均无效。有谁知道为什么会这样?
在您 link 的示例中,看看反引号是如何反斜杠的?您需要这样做,否则 hostname
在与其他机器通信之前会在 HW04 上的 shell 中执行。
首先,我会试试看您是否在与其他机器通话:
parallel -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4 5 6 7 8 9 10
然后,我会尝试在一台机器上跟踪您的无密码 ssh 设置,以确保它确实有效;来自 HW04 尝试:
parallel -S HW01 'echo -n {} ""; hostname' ::: 1
parallel -S HW02 'echo -n {} ""; hostname' ::: 1
parallel -S HW03 'echo -n {} ""; hostname' ::: 1
parallel -S HW04 'echo -n {} ""; hostname' ::: 1
(对 parallel-nodes.txt
文件中的每台机器重复)
如果其中一台机器无法使用 ssh,您可以尝试使用以下方法对其进行调试:
PARALLEL_SSH='ssh -v' parallel -S HW03 'echo -n {} ""; hostname' ::: 1
并行(通过 epel 从 yum
下载,而不是 gnu parallels 站点)没有从分发到远程主机的进程返回值,我不确定为什么。
我正在尝试 运行 的并发作业类似于这个简单的 example...
[myuser]$ parallel -q -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on `hostname`" ::: 1 2 3 4 5 6 7 8 9 10
Number 9: Running on HW04.co.local
Number 3: Running on HW04.co.local
Number 5: Running on HW04.co.local
Number 8: Running on HW04.co.local
Number 2: Running on HW04.co.local
Number 6: Running on HW04.co.local
^C^C^C^C%
这会挂起,直到我 ctl+c 退出(即只能从调用主机 运行 )。当不提供sshloginfile
时,没有问题...
[myuser]$ parallel -q -j 5 echo "Number {}: Running on `hostname`" ::: 1 2 3
Number 3: Running on HW04.co.local
Number 1: Running on HW04.co.local
Number 2: Running on HW04.co.local
- 我可以确认
--sshloginfile
中的所有节点都启用了无密码 ssh,并且可以 ssh passwordless 在所有涉及的节点之间。 - 还可以确认所有相关节点上都安装了 gnu parallels。
- 并且调用
parallel
的用户存在于所有涉及的节点上 - 以及检查出现在
sshloginfile
中的所有主机 FQDN 在相关主机的.ssh/known_hosts
文件中的名称是否相同。
当尝试 运行 这个并看到它挂起时,我尝试检查每个节点上可能与 parallel
命令相关的进程...
[root@HW01 ~]# clush -ab "ps -aux | grep echo"
---------------
HW01
---------------
root 136318 0.0 0.0 294648 16468 pts/2 Sl+ 15:39 0:00 /usr/bin/python2 /usr/bin/clush -ab ps -aux | grep echo
root 136322 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW01 ps -aux | grep echo
root 136323 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW02 ps -aux | grep echo
root 136324 0.0 0.0 185096 4820 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW03 ps -aux | grep echo
root 136325 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW04 ps -aux | grep echo
root 136334 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 136351 0.0 0.0 112712 968 ? S 15:39 0:00 grep echo
---------------
HW02
---------------
root 85835 0.0 0.0 113176 1580 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 85846 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW03
---------------
root 120282 0.0 0.0 113176 1576 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 120293 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW04
---------------
hph_etl 113600 1.5 0.0 157516 11944 pts/2 S+ 15:39 0:00 perl /bin/parallel -q -j 5 --sshloginfile /home/me/projects/myproject/parallel-nodes.txt echo Number {}: Running on HW04.co.local ::: 1 2 3 4 5 6 7 8 9 10
root 114154 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 114168 0.0 0.0 112712 960 ? S 15:39 0:00 grep echo
所以看起来好像命令根本没有传送到其他节点,只是停留在调用节点(这里是 HW04)上。然而,检查 firewalld
是否在任何主机上 运行ning...
[root@HW01 ~]# clush -ab systemctl status firewalld
---------------
HW01
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
---------------
HW02
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:17:27 HW02.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:17:28 HW02.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:32 HW02.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:33 HW02.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW03
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:11:15 HW03.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:11:16 HW03.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:46 HW03.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:47 HW03.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW04
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Thu 2019-07-25 15:00:33 HST; 4 days ago
Docs: man:firewalld(1)
Process: 3303 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 3303 (code=exited, status=0/SUCCESS)
Jul 25 15:00:32 HW04.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 25 15:00:33 HW04.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
clush: HW[01-04] (4): exited with exit code 3
显示它在所有主机上都处于非活动状态。
此时,不确定发生了什么问题。谁能想到任何调试建议或修复?
** 此外,当在命令中包含 --bibtex
选项时,上面列出的命令均无效。有谁知道为什么会这样?
在您 link 的示例中,看看反引号是如何反斜杠的?您需要这样做,否则 hostname
在与其他机器通信之前会在 HW04 上的 shell 中执行。
首先,我会试试看您是否在与其他机器通话:
parallel -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4 5 6 7 8 9 10
然后,我会尝试在一台机器上跟踪您的无密码 ssh 设置,以确保它确实有效;来自 HW04 尝试:
parallel -S HW01 'echo -n {} ""; hostname' ::: 1
parallel -S HW02 'echo -n {} ""; hostname' ::: 1
parallel -S HW03 'echo -n {} ""; hostname' ::: 1
parallel -S HW04 'echo -n {} ""; hostname' ::: 1
(对 parallel-nodes.txt
文件中的每台机器重复)
如果其中一台机器无法使用 ssh,您可以尝试使用以下方法对其进行调试:
PARALLEL_SSH='ssh -v' parallel -S HW03 'echo -n {} ""; hostname' ::: 1