MS MPI 权限错误
MS MPI Permission errors
我有两台机器都安装了 MS MPI 7.1,一台叫 SERVER,一台叫 COMPUTE。
这些机器在 LAN 上设置为一个简单的 windows 工作组(无 DA),并且都有一个具有相同名称和密码的帐户。
两者都运行正在使用 MSMPILaunchSvc 服务。
两台机器都可以在本地执行 MPI 作业,通过 hostname
命令
测试验证
SERVER> mpiexec -hosts 1 SERVER 1 hostname
SERVER
or
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname
COMPUTE
在机器本身的终端中。
为了方便起见,我已禁用两台机器上的防火墙。
我的问题是我无法从远程主机上的 SERVER 获取 MPI 到 运行 作业:
1:带有 MSMPILaunchSvc 的服务器 -> 带有 MSMPILaunchSvc 的计算
SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722
Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable. (errno 1722)
更令人沮丧的是,有时我会收到输入密码的提示。它建议 SERVER\Maarten 作为 COMPUTE 的用户,我已经在 SERVER 上登录的帐户不应该存在于 COMPUTE 上(那么应该是 COMPUTE\Maarten?)。尽管如此,它还是失败了:
SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd
Enter Password for SERVER\Maarten:
Save Credentials[y|n]? n
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the
smpd manager on COMPUTE:50915 error 1726
2:使用 MSMPILaunchSvc 进行计算 -> 使用 MSMPILaunchSvc 进行服务器
COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
3:使用 MSMPILaunchSvc 计算 -> 使用 smpd 守护进程的服务器
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
4:带有 MSMPILaunchSvc 的服务器 -> 带有 smpd 守护进程的计算
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the smpd manager on
COMPUTE:51022 error 1726
更新:
尝试在两个节点上使用 smpd 守护程序时出现此错误:
[-1:9796] Authentication completed. Successfully obtained Context for Client.
[-1:9796] version check complete, using PMP version 3.
[-1:9796] create manager process (using smpd daemon credentials)
[-1:9796] smpd reading the port string from the manager
[-1:9848] Launching smpd manager instance.
[-1:9848] created set for manager listener, 376
[-1:9848] smpd manager listening on port 51149
[-1:9796] closing the pipe to the manager
[-1:9848] Authentication completed. Successfully obtained Context for Client.
[-1:9848] Authorization completed.
[-1:9848] version check complete, using PMP version 3.
[-1:9848] Received session header from parent id=1, parent=0, level=0
[01:9848] Connecting back to parent using host SERVER and endpoint 17979
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos
[01:9848] Failed to connect back to parent error 5.
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5
[01:9848] smpd manager successfully stopped listening.
[01:9848] SMPD exiting with error code 4294967293.
在主机上:
[-1:12264] Launching SMPD service.
[-1:12264] smpd listening on port 8677
[-1:12264] Authentication completed. Successfully obtained Context for Client.
[-1:12264] version check complete, using PMP version 3.
[-1:12264] create manager process (using smpd daemon credentials)
[-1:12264] smpd reading the port string from the manager
[-1:16668] Launching smpd manager instance.
[-1:16668] created set for manager listener, 364
[-1:16668] smpd manager listening on port 18033
[-1:12264] closing the pipe to the manager
[-1:16668] Authentication completed. Successfully obtained Context for Client.
[-1:16668] Authorization completed.
[-1:16668] version check complete, using PMP version 3.
[-1:16668] Received session header from parent id=1, parent=0, level=0
[01:16668] Connecting back to parent using host SERVER and endpoint 18031
[01:16668] Authentication completed. Successfully obtained Context for Client.
[01:16668] Authorization completed.
[01:16668] handling command SMPD_CONNECT src=0
[01:16668] now connecting to COMPUTE
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:16668] using spn msmpi/COMPUTE to contact server
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context.
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726
[01:16668] sending abort command to parent context.
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0.
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677
[01:16668] Handling cmd=SMPD_ABORT result
[01:16668] cmd=SMPD_ABORT result will be handled locally
[01:16668] parent terminated unexpectedly - initiating cleaning up.
[01:16668] no child processes to kill - exiting with error code -1
经过反复试验,我发现在尝试 运行 具有不同配置的 MS MPI 时会出现这些错误和其他非特定错误(在我的例子中是 HPC Cluster 2008 和 HPC Cluster 2012 与 MSMPI 的混合)。
解决方案是将所有节点降级到 Windows Server 2008 R2 with HPC Cluster 2008。因为我不使用 AD,所以我不得不回退到使用 SMPD 守护进程并为其添加防火墙规则(跳过集群管理工具。
我有两台机器都安装了 MS MPI 7.1,一台叫 SERVER,一台叫 COMPUTE。 这些机器在 LAN 上设置为一个简单的 windows 工作组(无 DA),并且都有一个具有相同名称和密码的帐户。
两者都运行正在使用 MSMPILaunchSvc 服务。
两台机器都可以在本地执行 MPI 作业,通过 hostname
命令
SERVER> mpiexec -hosts 1 SERVER 1 hostname
SERVER
or
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname
COMPUTE
在机器本身的终端中。
为了方便起见,我已禁用两台机器上的防火墙。
我的问题是我无法从远程主机上的 SERVER 获取 MPI 到 运行 作业:
1:带有 MSMPILaunchSvc 的服务器 -> 带有 MSMPILaunchSvc 的计算
SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722
Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable. (errno 1722)
更令人沮丧的是,有时我会收到输入密码的提示。它建议 SERVER\Maarten 作为 COMPUTE 的用户,我已经在 SERVER 上登录的帐户不应该存在于 COMPUTE 上(那么应该是 COMPUTE\Maarten?)。尽管如此,它还是失败了:
SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd
Enter Password for SERVER\Maarten:
Save Credentials[y|n]? n
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the
smpd manager on COMPUTE:50915 error 1726
2:使用 MSMPILaunchSvc 进行计算 -> 使用 MSMPILaunchSvc 进行服务器
COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
3:使用 MSMPILaunchSvc 计算 -> 使用 smpd 守护进程的服务器
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
4:带有 MSMPILaunchSvc 的服务器 -> 带有 smpd 守护进程的计算
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the smpd manager on
COMPUTE:51022 error 1726
更新:
尝试在两个节点上使用 smpd 守护程序时出现此错误:
[-1:9796] Authentication completed. Successfully obtained Context for Client.
[-1:9796] version check complete, using PMP version 3.
[-1:9796] create manager process (using smpd daemon credentials)
[-1:9796] smpd reading the port string from the manager
[-1:9848] Launching smpd manager instance.
[-1:9848] created set for manager listener, 376
[-1:9848] smpd manager listening on port 51149
[-1:9796] closing the pipe to the manager
[-1:9848] Authentication completed. Successfully obtained Context for Client.
[-1:9848] Authorization completed.
[-1:9848] version check complete, using PMP version 3.
[-1:9848] Received session header from parent id=1, parent=0, level=0
[01:9848] Connecting back to parent using host SERVER and endpoint 17979
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos
[01:9848] Failed to connect back to parent error 5.
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5
[01:9848] smpd manager successfully stopped listening.
[01:9848] SMPD exiting with error code 4294967293.
在主机上:
[-1:12264] Launching SMPD service.
[-1:12264] smpd listening on port 8677
[-1:12264] Authentication completed. Successfully obtained Context for Client.
[-1:12264] version check complete, using PMP version 3.
[-1:12264] create manager process (using smpd daemon credentials)
[-1:12264] smpd reading the port string from the manager
[-1:16668] Launching smpd manager instance.
[-1:16668] created set for manager listener, 364
[-1:16668] smpd manager listening on port 18033
[-1:12264] closing the pipe to the manager
[-1:16668] Authentication completed. Successfully obtained Context for Client.
[-1:16668] Authorization completed.
[-1:16668] version check complete, using PMP version 3.
[-1:16668] Received session header from parent id=1, parent=0, level=0
[01:16668] Connecting back to parent using host SERVER and endpoint 18031
[01:16668] Authentication completed. Successfully obtained Context for Client.
[01:16668] Authorization completed.
[01:16668] handling command SMPD_CONNECT src=0
[01:16668] now connecting to COMPUTE
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:16668] using spn msmpi/COMPUTE to contact server
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context.
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726
[01:16668] sending abort command to parent context.
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0.
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677
[01:16668] Handling cmd=SMPD_ABORT result
[01:16668] cmd=SMPD_ABORT result will be handled locally
[01:16668] parent terminated unexpectedly - initiating cleaning up.
[01:16668] no child processes to kill - exiting with error code -1
经过反复试验,我发现在尝试 运行 具有不同配置的 MS MPI 时会出现这些错误和其他非特定错误(在我的例子中是 HPC Cluster 2008 和 HPC Cluster 2012 与 MSMPI 的混合)。
解决方案是将所有节点降级到 Windows Server 2008 R2 with HPC Cluster 2008。因为我不使用 AD,所以我不得不回退到使用 SMPD 守护进程并为其添加防火墙规则(跳过集群管理工具。