将简单结构传递给函数时出现 MPI 段错误

MPI segfaults when passing simple struct to function

我正在使用 MPI 在 C 语言中构建一个蒙特卡洛模拟,我 运行在使用结构读取文件时遇到了一个奇怪的错误。我已经在下面的简单代码中复制了这个问题。此示例代码以与更大的模拟相同的方式失败。以下是main.c的内容。 readme.txt的内容只是一小行文字。

#include <mpi.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct test_struct {
        char * filename;
} test_struct;


int read(struct test_struct * obj){
        FILE * file = fopen(obj->filename, "r");
        char buf[512];
        if (file == NULL) return -1;
        else {
                fgets(buf, sizeof(buf), file);
                printf("%s\n", buf);
        }
        fclose(file);
        return 0;

}

int main() {

        MPI_Init(NULL, NULL);

        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
        int world_size;
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);

        struct test_struct obj;

        obj.filename = (char *) malloc(256*sizeof(char));
        strcpy(obj.filename, "readme.txt");
        printf("%s\n", obj.filename);
        read(&obj);

        free(obj.filename);

        return 0;
}

我用这个简单的命令编译mpicc -g main.c。当我 运行 可执行文件时,我收到以下错误消息。

→ ./a.out
[lap-johnson:00190] *** Process received signal ***
[lap-johnson:00190] Signal: Segmentation fault (11)
[lap-johnson:00190] Signal code: Address not mapped (1)
[lap-johnson:00190] Failing at address: 0x7
[lap-johnson:00190] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f2023741730]
[lap-johnson:00190] [ 1] ./a.out(read+0x19)[0x7f20238a41ae]
[lap-johnson:00190] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_singleton.so(+0x2e77)[0x7f20225e2e77]
[lap-johnson:00190] [ 3] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_init+0x29a)[0x7f20234aa11a]
[lap-johnson:00190] [ 4] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_mpi_init+0x252)[0x7f202379be62]
[lap-johnson:00190] [ 5] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Init+0xa9)[0x7f20237ca1b9]
[lap-johnson:00190] [ 6] ./a.out(+0x1211)[0x7f20238a4211]
[lap-johnson:00190] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f202358409b]
[lap-johnson:00190] [ 8] ./a.out(+0x10da)[0x7f20238a40da]
[lap-johnson:00190] *** End of error message ***
[1]    190 segmentation fault (core dumped)  ./a.out

我尝试使用gdb 来查看错误是怎么回事。它表示 objtest_struct 的实例,位于内存地址 0x7。我认为程序会出现段错误,因为它试图读取这个无效的地址。 gdb 输出如下。

→ gdb ./a.out
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...done.
(gdb) run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 257]

Program received signal SIGSEGV, Segmentation fault.
0x00000000080011be in read (obj=0x7) at main.c:11
11              FILE * file = fopen(obj->filename, "r");
(gdb) print obj
 = (struct test_struct *) 0x7
(gdb) print obj->filename
Cannot access memory at address 0x7

为什么 read 函数会在内存地址 0x7 处看到结构?我可能在字符串操作方面做错了(或不符合标准)。但我不知道如何解决这个问题。请注意,这在 gcc 中编译和 运行s 完美(如果我当然删除了 MPI 的东西)。

我确实听说过 MPI 如何不喜欢将指针作为成员的结构。但我认为那是在发送和接收的背景下。对此问题的任何帮助表示赞赏。我对 MPI 很陌生。

我正在 运行在 Windows 子系统 Linux (4.4.0-19041-Microsoft) 内的 Debian 上打开 MPI 版本 3.1.3。我已经确认在我的 Debian Linux 机器上使用 Open MPI 版本 2.1.1 的自定义构建也出现了同样的问题。

read()libC 的子例程,您不应该重新定义它。相反,只需在您的代码中重命名此函数即可。

Open MPI 从 libC 调用 read() 但调用了您的子例程,因此出现了奇怪的堆栈跟踪。