mpirun:无法识别的参数 mca

mpirun: Unrecognized argument mca

我有一个 C++ 求解器,我需要使用以下命令并行 运行:

nohup mpirun -np 16 ./my_exec > log.txt &

此命令将 运行 my_exec 独立于我节点上可用的 16 个处理器。这曾经完美地工作。

上周,HPC 部门进行了 OS 升级,现在,当启动相同的命令时,我收到两条警告消息(针对每个处理器)。第一个是:

--------------------------------------------------------------------------                           
2 WARNING: It appears that your OpenFabrics subsystem is configured to only                            
3 allow registering part of your physical memory.  This can cause MPI jobs to                          
4 run with erratic performance, hang, and/or crash.                                                    
5                                                                                                      
6 This may be caused by your OpenFabrics vendor limiting the amount of                                 
7 physical memory that can be registered.  You should investigate the                                  
8 relevant Linux kernel module parameters that control how much physical                               
9 memory can be registered, and increase them to allow registering all                                 
10 physical memory on your machine.                                                                     
11                                                                                                      
12 See this Open MPI FAQ item for more information on these Linux kernel module                         
13 parameters:                                                                                          
14                                                                                                      
15     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages                                
16                                                                                                      
17   Local host:              tamnun                                                                    
18   Registerable memory:     32768 MiB                                                                 
19   Total memory:            98294 MiB                                                                 
20                                                                                                      
21 Your MPI job will continue, but may be behave poorly and/or hang.                                    
22 --------------------------------------------------------------------------                           
23 --------------------------------------------------------------------------        

然后我从我的代码中得到一个输出,它告诉我它认为我只启动了代码的 1 个实现(Nprocs = 1 而不是 16)。

177                                                                                                      
178 # MPI IS ON; Nprocs = 1                                                                              
179 Filename = ../input/odtParam.inp                                                                     
180                                                                                                      
181 # MPI IS ON; Nprocs = 1                                                                              
182                                                                                                      
183 ***** Error, process 0 failed to create ../data/data_0/, or it was already there

最后,第二条警告信息是:

185 --------------------------------------------------------------------------                           
186 An MPI process has executed an operation involving a call to the                                     
187 "fork()" system call to create a child process.  Open MPI is currently                               
188 operating in a condition that could result in memory corruption or                                   
189 other system errors; your MPI job may hang, crash, or produce silent                                 
190 data corruption.  The use of fork() (or system() or other calls that                                 
191 create child processes) is strongly discouraged.                                                     
192                                                                                                      
193 The process that invoked fork was:                                                                   
194                                                                                                      
195   Local host:          tamnun (PID 17446)                                                            
196   MPI_COMM_WORLD rank: 0                                                                             
197                                                                                                      
198 If you are *absolutely sure* that your application will successfully                                 
199 and correctly survive a call to fork(), you may disable this warning                                 
200 by setting the mpi_warn_on_fork MCA parameter to 0.                                                  
201 --------------------------------------------------------------------------     

在线查看后,我尝试按照警告消息的建议,使用以下命令将 MCA 参数 mpi_warn_on_fork 设置为 0:

nohup mpirun --mca mpi_warn_on_fork 0 -np 16 ./my_exec > log.txt &

产生以下错误消息:

[mpiexec@tamnun] match_arg (./utils/args/args.c:194): unrecognized argument mca
[mpiexec@tamnun] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error
[mpiexec@tamnun] parse_args (./ui/mpich/utils.c:2964): error parsing input array
[mpiexec@tamnun] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments

我正在使用 RedHat 6.7 (Santiago)。我联系了 HPC 部门,但由于我在大学,这个问题可能需要一两天才能回复。任何帮助或指导将不胜感激。

编辑以响应答案:

确实,我使用 Open MPI 的 mpic++ 编译我的代码,而 运行 使用 Intel 的 mpirun 命令编译可执行文件,因此出现错误(在 OS 升级之后Intel 的 mpirun 被设置为默认值)。我必须将 Open MPI 的 mpirun 路径放在 $PATH 环境变量的开头。

代码现在 运行 符合预期,但我仍然收到上面的第一条警告消息(它不建议我再使用 MCA 参数 mpi_warn_on_fork。我认为(但不确定)这是我需要与 HPC 部门解决的问题。

[mpiexec@tamnun] match_arg (./utils/args/args.c:194): unrecognized argument mca
[mpiexec@tamnun] HYDU_parse_array (./utils/args/args.c:214): argument matching returned error
[mpiexec@tamnun] parse_args (./ui/mpich/utils.c:2964): error parsing input array
                                  ^^^^^
[mpiexec@tamnun] HYD_uii_mpx_get_parameters (./ui/mpich/utils.c:3238): unable to parse user arguments
                                                  ^^^^^

您在最后一个案例中使用的是 MPICH。 MPICH 不是 Open MPI,它的进程启动器不识别特定于 Open MPI 的 --mca 参数(MCA 代表模块化组件架构 - Open MPI 构建的基本框架)。混合多个 MPI 实现的典型案例。