防火墙导致两个MPI计算节点无法完成TCP连接
Two MPI computing nodes cannot complete a TCP connection cause by firewall
我正在尝试在两个计算节点 node1
和 node2
上 运行 一个简单的 MPI 示例,它们是我刚刚在 Oracle Cloud 上创建的虚拟机。 (第一次用Oracle Cloud。。。)系统是Ubuntu20.04。我所做的包括:
node1
和node2
在同一路径下有正确的MPI环境(OpenMPI-4.1.0)。 $PATH
和 $LD_LIBRARY_PATH
也已设置。我可以在单个节点上成功 运行 MPI 示例。
- 已设置
node1
和 node2
之间的无密码登录。我可以使用 ssh node1
和 ssh node2
将一个节点连接到另一个节点。
- 在同一路径(
$HOSTFILE_PATH/hosts2
)下的两个节点上有一个主机文件(hosts2
)包含
node1 slots=1
node2 slots=1
- 可执行文件(
test
)在同一路径($EXE_PATH/test
)下。
然后我运行$(which mpirun) -n 2 -hostfile $HOSTFILE_PATH/hosts2 $EXEC_PATH/test
,我没有得到return。所以我只能用ctrl+c终止执行。几分钟后,我得到了输出:
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: instance-1-632783
Remote host: instance-1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
问题是否与防火墙有关?我尝试了 sudo ufw status
,得到了 Status: inactive
。我也试过sudo iptables -L
,得到:
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT icmp -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT udp -- anywhere anywhere udp spt:ntp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
Chain FORWARD (policy ACCEPT)
target prot opt source destination
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
InstanceServices all -- anywhere link-local/16
Chain InstanceServices (1 references)
target prot opt source destination
ACCEPT tcp -- anywhere 169.254.0.2 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.2.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.4.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.5.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.2 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.169.254 tcp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.3 owner UID match root tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.4 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.169.254 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:bootps /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:tftp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:ntp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
REJECT tcp -- anywhere link-local/16 tcp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with tcp-reset
REJECT udp -- anywhere link-local/16 udp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with icmp-port-unreachable
然后我试了sudo iptables -F
,之后sudo iptables -L
显示:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain InstanceServices (0 references)
target prot opt source destination
不过好像sudo iptables -F
暂时删除了政策。当我重新启动系统时,sudo iptables -L
显示以前的输出。那么如何解决防火墙问题呢?我应该永久删除这些政策吗?以及如何?
有时 ufw 命令不会改变 OCI 中的 iptable。我建议您改用 iptable 命令。更多命令请参考linux-iptables-firewall-rules-examples-commands。
请使用以下命令列出所有 IPv4 规则:
sudo iptables -S
即使 VM 位于同一子网中,您仍然必须允许它们之间的流量。
因此在您正在使用的子网的安全列表中打开所需的端口 (https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/securitylists.htm#Security_Lists)
如果您不知道所需的端口,您可以打开所有端口(这对于生产环境来说不是一个好的做法)。
我正在尝试在两个计算节点 node1
和 node2
上 运行 一个简单的 MPI 示例,它们是我刚刚在 Oracle Cloud 上创建的虚拟机。 (第一次用Oracle Cloud。。。)系统是Ubuntu20.04。我所做的包括:
node1
和node2
在同一路径下有正确的MPI环境(OpenMPI-4.1.0)。$PATH
和$LD_LIBRARY_PATH
也已设置。我可以在单个节点上成功 运行 MPI 示例。- 已设置
node1
和node2
之间的无密码登录。我可以使用ssh node1
和ssh node2
将一个节点连接到另一个节点。 - 在同一路径(
$HOSTFILE_PATH/hosts2
)下的两个节点上有一个主机文件(hosts2
)包含
node1 slots=1
node2 slots=1
- 可执行文件(
test
)在同一路径($EXE_PATH/test
)下。
然后我运行$(which mpirun) -n 2 -hostfile $HOSTFILE_PATH/hosts2 $EXEC_PATH/test
,我没有得到return。所以我只能用ctrl+c终止执行。几分钟后,我得到了输出:
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: instance-1-632783
Remote host: instance-1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
问题是否与防火墙有关?我尝试了 sudo ufw status
,得到了 Status: inactive
。我也试过sudo iptables -L
,得到:
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT icmp -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT udp -- anywhere anywhere udp spt:ntp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
Chain FORWARD (policy ACCEPT)
target prot opt source destination
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
InstanceServices all -- anywhere link-local/16
Chain InstanceServices (1 references)
target prot opt source destination
ACCEPT tcp -- anywhere 169.254.0.2 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.2.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.4.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.5.0/24 owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.2 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.169.254 tcp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.3 owner UID match root tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.0.4 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT tcp -- anywhere 169.254.169.254 tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:bootps /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:tftp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT udp -- anywhere 169.254.169.254 udp dpt:ntp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
REJECT tcp -- anywhere link-local/16 tcp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with tcp-reset
REJECT udp -- anywhere link-local/16 udp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with icmp-port-unreachable
然后我试了sudo iptables -F
,之后sudo iptables -L
显示:
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain InstanceServices (0 references)
target prot opt source destination
不过好像sudo iptables -F
暂时删除了政策。当我重新启动系统时,sudo iptables -L
显示以前的输出。那么如何解决防火墙问题呢?我应该永久删除这些政策吗?以及如何?
有时 ufw 命令不会改变 OCI 中的 iptable。我建议您改用 iptable 命令。更多命令请参考linux-iptables-firewall-rules-examples-commands。
请使用以下命令列出所有 IPv4 规则:
sudo iptables -S
即使 VM 位于同一子网中,您仍然必须允许它们之间的流量。
因此在您正在使用的子网的安全列表中打开所需的端口 (https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/securitylists.htm#Security_Lists)
如果您不知道所需的端口,您可以打开所有端口(这对于生产环境来说不是一个好的做法)。