ipython 使用机器文件进行 MPI 集群
ipython with MPI clustering using machinefile
根据对 mpi4py 演示目录中的 hellowworld.py 脚本的测试,我已经成功配置了三个节点上的 mpi4py 支持的 mpi:
gms@host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
我现在正在尝试让这个在 ipython 中工作,并将我的机器文件添加到我的 $IPYTHON_DIR/profile_mpi/ipcluster_config.py 文件中,如下所示:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
然后我使用以下命令在头节点上启动 iPython notebook:ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &
而且,瞧,我可以从本地网络上的另一台设备正常访问它。但是,当我从 iPython notebook 运行 helloworld.py 时,我只得到头节点的响应:Hello, World! I am process 0 of 10 on host.
我从 iPython 开始使用 10 个引擎的 mpi,但是...
我进一步配置了这些参数,以防万一
在 $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
在 $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
在 $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
然而,这些也无济于事。
我缺少什么才能让它正常工作?
编辑更新 1
我现在在我的工作节点上安装了 NFS 目录,因此,我满足了要求 "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." 能够使用 ipcluster
启动我的控制器和引擎,使用命令 ipcluster start --profile=mpi -n 6 &
.
所以,我在我的头节点上发布这个,然后得到:
2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully
然后,继续发出相同的命令以启动其他节点上的引擎,但我得到:
2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+ Stopped ipcluster start --profile=mpi -n 6
没有确认 Engines appear to have started successfully
...
更令人困惑的是,当我在工作节点上执行 ps au
时,我得到:
gms 3862 0.1 2.5 38684 23740 pts/0 T 20:31 0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms 3874 0.1 1.7 21428 16772 pts/0 T 20:31 0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts/0 T 20:31 0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts/0 T 20:31 0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
其中进程 3376 和 3378 中的 ip 地址来自集群中的其他主机。但是...
当我 运行 直接使用 ipython 进行类似测试时,我得到的只是来自本地主机的响应(即使减去 ipython,这直接与 mpi 和mpi4py 如我的原文所述 post):
gms@head:~/development/mpi$ ipython test.py
head[3834]: 0/1
gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
虽然我确信我的配置现在是正确的,但我似乎仍然遗漏了一些明显的东西。突然出现的一件事是,当我在我的工作节点上启动 ipcluster
时,我得到了这个:2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
编辑更新 2
这更多是为了记录正在发生的事情,并希望最终是什么让它起作用:
我清理了我的日志文件并重新发布了ipcluster start --profile=mpi -n 6 &
现在我的引擎有 6 个日志文件,我的控制器有 1 个日志文件:
drwxr-xr-x 2 gms gms 12288 Mar 6 03:28 .
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r--r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log
在日志中查找 ipcontroller 似乎只有一个引擎注册:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
6个引擎不应该都注册吗?
引擎的 2 个日志看起来没有问题:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
而另一个注册了id 0
但是,其他4个引擎报了超时错误:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
嗯...我想我明天可以尝试重新安装 ipython。
编辑更新 3
安装了 ipython 的冲突版本(看起来是通过 apt-get 和 pip)。使用 pip install ipython[all]
卸载并重新安装...
编辑更新 4
我希望有人觉得这很有用,我希望有人能在某个时候发表意见来帮助澄清一些事情。
任何人,我安装了一个 virtualenv 来隔离我的环境,我认为它看起来有点成功。我在我的每个节点上启动了 'ipcluster start -n 4 --profile=mpi',然后通过 ssh 返回到我的头节点和 运行 一个测试脚本,它首先调用 ipcluster
。以下输出: 所以,它正在做一些并行计算。
然而,当我 运行 我的查询所有节点的测试脚本时,我只得到头节点:
但是,再说一次,如果我只是 运行 直线上升 mpiexec
命令,一切都很好。
更令人困惑的是,如果我查看节点上的进程,我会看到各种行为表明它们正在协同工作:
我的日志中没有任何异常。为什么我没有在我的第二个测试脚本中返回节点(此处包含代码:):
# test_mpi.py
import os
import socket
from mpi4py import MPI
MPI = MPI.COMM_WORLD
print("{host}[{pid}]: {rank}/{size}".format(
host=socket.gethostname(),
pid=os.getpid(),
rank=MPI.rank,
size=MPI.size,
))
不知道为什么,但我重新创建了我的 ipcluster_config.py 文件并再次添加 c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] 到它,这次它起作用了 - 对于一些奇怪的原因。我可以发誓我以前有过这个,但是唉......
根据对 mpi4py 演示目录中的 hellowworld.py 脚本的测试,我已经成功配置了三个节点上的 mpi4py 支持的 mpi:
gms@host:~/development/mpi$ mpiexec -f machinefile -n 10 python ~/development/mpi4py/demo/helloworld.py
Hello, World! I am process 3 of 10 on host.
Hello, World! I am process 1 of 10 on worker1.
Hello, World! I am process 6 of 10 on host.
Hello, World! I am process 2 of 10 on worker2.
Hello, World! I am process 4 of 10 on worker1.
Hello, World! I am process 9 of 10 on host.
Hello, World! I am process 5 of 10 on worker2.
Hello, World! I am process 7 of 10 on worker1.
Hello, World! I am process 8 of 10 on worker2.
Hello, World! I am process 0 of 10 on host.
我现在正在尝试让这个在 ipython 中工作,并将我的机器文件添加到我的 $IPYTHON_DIR/profile_mpi/ipcluster_config.py 文件中,如下所示:
c.MPILauncher.mpi_args = ["-machinefile", "/home/gms/development/mpi/machinefile"]
然后我使用以下命令在头节点上启动 iPython notebook:ipython notebook --profile=mpi --ip=* --port=9999 --no-browser &
而且,瞧,我可以从本地网络上的另一台设备正常访问它。但是,当我从 iPython notebook 运行 helloworld.py 时,我只得到头节点的响应:Hello, World! I am process 0 of 10 on host.
我从 iPython 开始使用 10 个引擎的 mpi,但是...
我进一步配置了这些参数,以防万一
在 $IPYTHON_DIR/profile_mpi/ipcluster_config.py
c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'
在 $IPYTHON_DIR/profile_mpi/ipengine_config.py
c.MPI.use = 'mpi4py'
在 $IPYTHON_DIR/profile_mpi/ipcontroller_config.py
c.HubFactory.ip = '*'
然而,这些也无济于事。
我缺少什么才能让它正常工作?
编辑更新 1
我现在在我的工作节点上安装了 NFS 目录,因此,我满足了要求 "Currently ipcluster requires that the IPYTHONDIR/profile_/security directory live on a shared filesystem that is seen by both the controller and engines." 能够使用 ipcluster
启动我的控制器和引擎,使用命令 ipcluster start --profile=mpi -n 6 &
.
所以,我在我的头节点上发布这个,然后得到:
2016-03-04 20:31:26.280 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:26.283 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:26.284 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:27.282 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
2016-03-04 20:31:57.301 [IPClusterStart] Engines appear to have started successfully
然后,继续发出相同的命令以启动其他节点上的引擎,但我得到:
2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.095 [IPClusterStart] Starting ipcluster with [daemon=False]
2016-03-04 20:31:33.100 [IPClusterStart] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
2016-03-04 20:31:33.111 [IPClusterStart] Starting Controller with LocalControllerLauncher
2016-03-04 20:31:34.098 [IPClusterStart] Starting 6 Engines with MPIEngineSetLauncher
[1]+ Stopped ipcluster start --profile=mpi -n 6
没有确认 Engines appear to have started successfully
...
更令人困惑的是,当我在工作节点上执行 ps au
时,我得到:
gms 3862 0.1 2.5 38684 23740 pts/0 T 20:31 0:01 /usr/bin/python /usr/bin/ipcluster start --profile=mpi -n 6
gms 3874 0.1 1.7 21428 16772 pts/0 T 20:31 0:01 /usr/bin/python -c from IPython.parallel.apps.ipcontrollerapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.co
gms 3875 0.0 0.2 4768 2288 pts/0 T 20:31 0:00 mpiexec -n 6 -machinefile /home/gms/development/mpi/machinefile /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new
gms 3876 0.0 0.4 5732 4132 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.1 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 -
gms 3877 0.0 0.1 4816 1204 pts/0 T 20:31 0:00 /usr/bin/hydra_pmi_proxy --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --proxy-id 1
gms 3878 0.0 0.4 5732 4028 pts/0 T 20:31 0:00 /usr/bin/ssh -x 192.168.1.201 "/usr/bin/hydra_pmi_proxy" --control-port 192.168.1.200:36753 --rmk user --launcher ssh --demux poll --pgid 0
gms 3879 0.0 0.6 8944 6008 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
gms 3880 0.0 0.6 8944 6108 pts/0 T 20:31 0:00 /usr/bin/python -c from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance() --profile-dir /home/gms/.config
其中进程 3376 和 3378 中的 ip 地址来自集群中的其他主机。但是...
当我 运行 直接使用 ipython 进行类似测试时,我得到的只是来自本地主机的响应(即使减去 ipython,这直接与 mpi 和mpi4py 如我的原文所述 post):
gms@head:~/development/mpi$ ipython test.py
head[3834]: 0/1
gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
worker1[3961]: 4/10
worker1[3962]: 7/10
head[3946]: 6/10
head[3944]: 0/10
worker2[4054]: 5/10
worker2[4055]: 8/10
head[3947]: 9/10
worker1[3960]: 1/10
worker2[4053]: 2/10
head[3945]: 3/10
虽然我确信我的配置现在是正确的,但我似乎仍然遗漏了一些明显的东西。突然出现的一件事是,当我在我的工作节点上启动 ipcluster
时,我得到了这个:2016-03-04 20:31:33.092 [IPClusterStart] Removing pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcluster.pid
编辑更新 2
这更多是为了记录正在发生的事情,并希望最终是什么让它起作用:
我清理了我的日志文件并重新发布了ipcluster start --profile=mpi -n 6 &
现在我的引擎有 6 个日志文件,我的控制器有 1 个日志文件:
drwxr-xr-x 2 gms gms 12288 Mar 6 03:28 .
drwxr-xr-x 7 gms gms 4096 Mar 6 03:31 ..
-rw-r--r-- 1 gms gms 1313 Mar 6 03:28 ipcontroller-15664.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15669.log
-rw-r--r-- 1 gms gms 598 Mar 6 03:28 ipengine-15670.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4405.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4406.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4628.log
-rw-r--r-- 1 gms gms 499 Mar 6 03:28 ipengine-4629.log
在日志中查找 ipcontroller 似乎只有一个引擎注册:
2016-03-06 03:28:12.469 [IPControllerApp] Hub listening on tcp://*:34540 for registration.
2016-03-06 03:28:12.480 [IPControllerApp] Hub using DB backend: 'NoDB'
2016-03-06 03:28:12.749 [IPControllerApp] hub::created hub
2016-03-06 03:28:12.751 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-client.json
2016-03-06 03:28:12.754 [IPControllerApp] writing connection info to /home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json
2016-03-06 03:28:12.758 [IPControllerApp] task::using Python leastload Task scheduler
2016-03-06 03:28:12.760 [IPControllerApp] Heartmonitor started
2016-03-06 03:28:12.808 [IPControllerApp] Creating pid file: /home/gms/.config/ipython/profile_mpi/pid/ipcontroller.pid
2016-03-06 03:28:14.792 [IPControllerApp] client::client 'a8441250-d3d7-4a0b-8210-dae327665450' requested 'registration_request'
2016-03-06 03:28:14.800 [IPControllerApp] client::client '12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295' requested 'registration_request'
2016-03-06 03:28:18.764 [IPControllerApp] registration::finished registering engine 1:'12fd0bcc-24e9-4ad0-8154-fcf1c7a0e295'
2016-03-06 03:28:18.768 [IPControllerApp] engine::Engine Connected: 1
2016-03-06 03:28:20.800 [IPControllerApp] registration::purging stalled registration: 0
6个引擎不应该都注册吗?
引擎的 2 个日志看起来没有问题:
2016-03-06 03:28:13.746 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:13.746 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.735 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.780 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:15.282 [IPEngineApp] Using existing profile dir:
u'/home/gms/.config/ipython/profile_mpi'
2016-03-06 03:28:15.286 [IPEngineApp] Completed registration with id 1
而另一个注册了id 0
但是,其他4个引擎报了超时错误:
2016-03-06 03:28:14.676 [IPEngineApp] Initializing MPI:
2016-03-06 03:28:14.689 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
2016-03-06 03:28:14.733 [IPEngineApp] Loading url_file u'/home/gms/.config/ipython/profile_mpi/security/ipcontroller-engine.json'
2016-03-06 03:28:14.805 [IPEngineApp] Registering with controller at tcp://127.0.0.1:34540
2016-03-06 03:28:16.807 [IPEngineApp] Registration timed out after 2.0 seconds
嗯...我想我明天可以尝试重新安装 ipython。
编辑更新 3
安装了 ipython 的冲突版本(看起来是通过 apt-get 和 pip)。使用 pip install ipython[all]
卸载并重新安装...
编辑更新 4
我希望有人觉得这很有用,我希望有人能在某个时候发表意见来帮助澄清一些事情。
任何人,我安装了一个 virtualenv 来隔离我的环境,我认为它看起来有点成功。我在我的每个节点上启动了 'ipcluster start -n 4 --profile=mpi',然后通过 ssh 返回到我的头节点和 运行 一个测试脚本,它首先调用 ipcluster
。以下输出:
然而,当我 运行 我的查询所有节点的测试脚本时,我只得到头节点:
但是,再说一次,如果我只是 运行 直线上升 mpiexec
命令,一切都很好。
更令人困惑的是,如果我查看节点上的进程,我会看到各种行为表明它们正在协同工作:
我的日志中没有任何异常。为什么我没有在我的第二个测试脚本中返回节点(此处包含代码:):
# test_mpi.py
import os
import socket
from mpi4py import MPI
MPI = MPI.COMM_WORLD
print("{host}[{pid}]: {rank}/{size}".format(
host=socket.gethostname(),
pid=os.getpid(),
rank=MPI.rank,
size=MPI.size,
))
不知道为什么,但我重新创建了我的 ipcluster_config.py 文件并再次添加 c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] 到它,这次它起作用了 - 对于一些奇怪的原因。我可以发誓我以前有过这个,但是唉......