运行 在集群上具有 sudo 权限的 MPI 程序
Running MPI programs with sudo permission on cluster
我在小型 Raspberry PI 集群上工作,我的主机程序创建 IP 数据包片段并将它们发送到多个中继程序。中继接收这些数据包片段并使用原始套接字将它们转发到目的地。由于原始套接字,我的中继程序必须 运行 具有 sudo 权限。我的设置涉及 RPi 3 B v2 和 RPi 2 B v1。 SSH 已经设置好,节点可以在没有密码的情况下通过 SSH 登录,尽管我必须在每个节点上 运行 ssh-agent 和 ssh-add 我的密钥。我已经设法 运行 编程将排名从一个节点发送到另一个节点(2 个不同的 RPis)。我 运行 MPMD 方式的 MPI 程序,因为我只有 2 个 RPis 我 运行 在节点 #1 上托管和中继,在节点 #2 上中继。主机程序将文件路径作为命令行参数发送。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 /home/pi/Desktop/relay
它 运行s,但显然程序失败了,因为中继无法在没有 sudo 权限的情况下打开原始套接字。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay
中继报告世界大小:1 并且主机程序挂起。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost sudo /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay
所有中继和主机报告世界大小 1。
我在这里发现了类似的问题:OpenMPI / mpirun or mpiexec with sudo permission
以下简短回答我 运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo -E /home/pi/Desktop/relay
这导致:
[raspberrypi:00979] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
[raspberrypi:00980] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
*** An error occurred in MPI_Init
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[raspberrypi:00979] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[raspberrypi:00980] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32582,1],1]
Exit code: 1
--------------------------------------------------------------------------
我已经 运行 sudo visudo 并且我在两个节点上的文件如下所示:
# User privilege specification
root ALL=(ALL:ALL) ALL
pi ALL = NOPASSWD:SETENV: /etc/alternatives/mpirun
pi ALL=NOPASSWD:SETENV: /usr/bin/orterun
pi ALL=NOPASSWD:SETENV: /usr/bin/mpirun
当我 运行 一个节点上的所有东西都正常工作时:
sudo mpirun --alow-run-as-root --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,localhost /home/pi/Desktop/relay
//主机
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size = []() {
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
return size;
}();
int id = []() {
int id;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
return id;
}();
if (argc != 2) {
std::cerr << "Filepath not passed\n";
MPI_Finalize();
return 0;
}
const std::filesystem::path filepath(argv[1]);
if (not std::filesystem::exists(filepath)) {
std::cerr << "File doesn't exist\n";
MPI_Finalize();
return 0;
}
std::cout << "World size: " << world_size << '\n';
MPI_Finalize();
return 0;
}
//relay
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size = []() {
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
return size;
}();
int id = []() {
int id;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
return id;
}();
std::cout << "World size: " << world_size << '\n';
MPI_Finalize();
return 0;
}
如何配置节点以允许它们使用 sudo 运行 MPI 程序?
解决这个问题最简单的方法就是设置文件的capabilities,它仍然会带来安全问题,但没有设置程序的suid为root那么严重。设置允许打开原始套接字的程序的功能:setcap program cap_net_raw,cap_net_admin+eip
.
我在小型 Raspberry PI 集群上工作,我的主机程序创建 IP 数据包片段并将它们发送到多个中继程序。中继接收这些数据包片段并使用原始套接字将它们转发到目的地。由于原始套接字,我的中继程序必须 运行 具有 sudo 权限。我的设置涉及 RPi 3 B v2 和 RPi 2 B v1。 SSH 已经设置好,节点可以在没有密码的情况下通过 SSH 登录,尽管我必须在每个节点上 运行 ssh-agent 和 ssh-add 我的密钥。我已经设法 运行 编程将排名从一个节点发送到另一个节点(2 个不同的 RPis)。我 运行 MPMD 方式的 MPI 程序,因为我只有 2 个 RPis 我 运行 在节点 #1 上托管和中继,在节点 #2 上中继。主机程序将文件路径作为命令行参数发送。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 /home/pi/Desktop/relay
它 运行s,但显然程序失败了,因为中继无法在没有 sudo 权限的情况下打开原始套接字。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay
中继报告世界大小:1 并且主机程序挂起。
如果我运行:
mpirun --oversubscribe -n 1 --host localhost sudo /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo /home/pi/Desktop/relay
所有中继和主机报告世界大小 1。
我在这里发现了类似的问题:OpenMPI / mpirun or mpiexec with sudo permission
以下简短回答我 运行:
mpirun --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,rpi2 sudo -E /home/pi/Desktop/relay
这导致:
[raspberrypi:00979] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
[raspberrypi:00980] OPAL ERROR: Unreachable in file ext2x_client.c at line 109
*** An error occurred in MPI_Init
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[raspberrypi:00979] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[raspberrypi:00980] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32582,1],1]
Exit code: 1
--------------------------------------------------------------------------
我已经 运行 sudo visudo 并且我在两个节点上的文件如下所示:
# User privilege specification
root ALL=(ALL:ALL) ALL
pi ALL = NOPASSWD:SETENV: /etc/alternatives/mpirun
pi ALL=NOPASSWD:SETENV: /usr/bin/orterun
pi ALL=NOPASSWD:SETENV: /usr/bin/mpirun
当我 运行 一个节点上的所有东西都正常工作时:
sudo mpirun --alow-run-as-root --oversubscribe -n 1 --host localhost /home/pi/Desktop/host /some.jpeg : -n 2 --host localhost,localhost /home/pi/Desktop/relay
//主机
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size = []() {
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
return size;
}();
int id = []() {
int id;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
return id;
}();
if (argc != 2) {
std::cerr << "Filepath not passed\n";
MPI_Finalize();
return 0;
}
const std::filesystem::path filepath(argv[1]);
if (not std::filesystem::exists(filepath)) {
std::cerr << "File doesn't exist\n";
MPI_Finalize();
return 0;
}
std::cout << "World size: " << world_size << '\n';
MPI_Finalize();
return 0;
}
//relay
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
int world_size = []() {
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
return size;
}();
int id = []() {
int id;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
return id;
}();
std::cout << "World size: " << world_size << '\n';
MPI_Finalize();
return 0;
}
如何配置节点以允许它们使用 sudo 运行 MPI 程序?
解决这个问题最简单的方法就是设置文件的capabilities,它仍然会带来安全问题,但没有设置程序的suid为root那么严重。设置允许打开原始套接字的程序的功能:setcap program cap_net_raw,cap_net_admin+eip
.