一种基于另一个变量执行值的 MPI 全部减少的有效方法?

An efficient way to perform an all reduction in MPI of a value based on another variable?

举个例子,假设我有

int a = ...;
int b = ...;
int c;

其中 a 是一些复杂的本地计算的结果,ba.

质量的一些指标

我想将 a 的最佳值发送到每个进程并将其存储在 c 中,其中最佳值是通过 b 的最大值来定义的。

我想我只是想知道是否有比在 ab 上执行 allgather 然后搜索结果数组更有效的方法。

实际代码涉及在最多 hundred/thousand 个进程上发送和比较数百个值,因此欢迎任何效率提升。

您可以将 b 的值与进程的等级配对,以找到包含最大值 b 的等级。 MPI_DOUBLE_INT 类型对此非常有用。然后,您可以从此等级广播 a,以便在每个进程中获得值。

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    int my_rank;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    // Create random a and b on each rank.
    srand(123 + my_rank);
    double a = rand() / (double)RAND_MAX;
    double b = rand() / (double)RAND_MAX;

    struct
    {
        double value;
        int rank;
    } s_in, s_out;

    s_in.value = b;
    s_in.rank = my_rank;

    printf("before: %d, %f, %f\n", my_rank, a, b);

    // Find the maximum value of b and the corresponding rank.
    MPI_Allreduce(&s_in, &s_out, 1, MPI_DOUBLE_INT, MPI_MAXLOC, MPI_COMM_WORLD);
    b = s_out.value;

    // Broadcast from the rank with the maximum value.
    MPI_Bcast(&a, 1, MPI_DOUBLE, s_out.rank, MPI_COMM_WORLD);

    printf("after: %d, %f, %f\n", my_rank, a, b);

    MPI_Finalize();
}

I guess I'm just wondering if there is a more efficient way of doing this than doing an allgather on a and b and then searching through the resulting arrays.

这可以通过仅一个 MPI_AllReduce.

实现

我将介绍两种方法,一种更简单(适合您的用例);和一个更通用的,用于更复杂的用例。后者也将有助于展示案例 MPI 功能,例如自定义 MPI 数据类型和自定义 MPI 缩减运算符。

方法一

代表

int a = ...;
int b = ...;

您可以使用以下结构:

typedef struct MyStruct {
    int b;
    int a;
} S;

那么你可以使用 MPI 数据类型 MPI_2INT and the MPI operator MAXLOC

The operator MPI_MINLOC is used to compute a global minimum and also an index attached to the minimum value. **MPI_MAXLOC similarly computes a global maximum and index. One application of these is to compute a global minimum (maximum) and the rank of the process containing this value.

在您的例子中,我们将使用 'a' 的值而不是 rank。因此,MPI_AllReduce 调用:

 S  local, global;
 ...
 MPI_Allreduce(&local, &global, 1, MPI_2INT, MPI_MAXLOC, MPI_COMM_WORLD);

完整的代码如下所示:

#include <stdio.h>
#include <mpi.h>

typedef struct MyStruct {
    int b;
    int a;
} S;


int main(int argc,char *argv[]){
    MPI_Init(NULL,NULL); // Initialize the MPI environment
    int world_rank; 
    int world_size;
    MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
    MPI_Comm_size(MPI_COMM_WORLD,&world_size);
    
    // Some fake data
    S local, global;
    local.a = world_rank;
    local.b = world_size - world_rank;

    MPI_Allreduce(&local, &global, 1, MPI_2INT, MPI_MAXLOC, MPI_COMM_WORLD);
          
    if(world_rank == 0){
      printf("%d %d\n", global.b, global.a);
    }

    MPI_Finalize();
    return 0;
 }

第二种方法

MPI_MAXLOC只适用于一定数量的predefined datatypes. Nonetheless, for the remaining cases you can use the following approach (based on this SO thread):

  1. 创建将包含值 abstruct
  2. 创建自定义 MPI_Datatype 表示 1. struct 发送 跨进程;
  3. 使用MPI_AllReduce:

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

Combines values from all processes and distributes the result back to all processes

  1. 使用运算MAX;

I'd like to send the best value of 'a' to every process and store it in 'c' where best is defined by having the largest value of 'b'.

  1. 然后你必须告诉 MPI 只考虑结构的元素b。因此,您需要创建自定义 MPI_Op max 操作。

编码方法

那么让我们一步一步打破上述实施:

首先定义struct:

typedef struct MyStruct {
    double a, b;
} S;

第二次创建自定义 MPI_Datatype:

void defineStruct(MPI_Datatype *tstype) {
    const int count = 2;
    int          blocklens[count];
    MPI_Datatype types[count];
    MPI_Aint     disps[count];

    for (int i=0; i < count; i++){
        types[i] = MPI_DOUBLE;
        blocklens[i] = 1;
    }
    disps[0] = offsetof(S,a);
    disps[1] = offsetof(S,b);

    MPI_Type_create_struct(count, blocklens, disps, types, tstype);
    MPI_Type_commit(tstype);
}

非常重要 请注意,由于我们使用的是 struct,因此您必须注意 (source)

the C standard allows arbitrary padding between the fields.

所以用两个 double 减少一个 struct 与用两个 double 减少一个数组 NOT 是一样的。

main 你必须做的:

MPI_Datatype structtype;
defineStruct(&structtype);

第三次创建自定义最大操作:

void max_struct(void *in, void *inout, int *len, MPI_Datatype *type){
    S *invals    = in;
    S *inoutvals = inout;
    for (int i=0; i < *len; i++)
        inoutvals[i].b  = (inoutvals[i].b > invals[i].b) ? inoutvals[i].b  : invals[i].b;
}

main 中执行:

MPI_Op       maxstruct;
MPI_Op_create(max_struct, 1, &maxstruct);

最后调用 MPI_AllReduce:

S local, global;
...
MPI_Allreduce(&local, &global, 1, structtype, maxstruct, MPI_COMM_WORLD); 

整个代码放在一起:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

typedef struct MyStruct {
    double a, b;
} S;

void max_struct(void *in, void *inout, int *len, MPI_Datatype *type){
    S *invals    = in;
    S *inoutvals = inout;
    for (int i=0; i<*len; i++)
        inoutvals[i].b  = (inoutvals[i].b > invals[i].b) ? inoutvals[i].b  : invals[i].b;
}

void defineStruct(MPI_Datatype *tstype) {
    const int count = 2;
    int          blocklens[count];
    MPI_Datatype types[count];
    MPI_Aint     disps[count];

    for (int i=0; i < count; i++) {
        types[i] = MPI_DOUBLE;
        blocklens[i] = 1;
    }
    disps[0] = offsetof(S,a);
    disps[1] = offsetof(S,b);

    MPI_Type_create_struct(count, blocklens, disps, types, tstype);
    MPI_Type_commit(tstype);
}

int main(int argc,char *argv[]){
    MPI_Init(NULL,NULL); // Initialize the MPI environment
    int world_rank; 
    int world_size;
    MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
    MPI_Comm_size(MPI_COMM_WORLD,&world_size);
    MPI_Datatype structtype;
    MPI_Op       maxstruct;
    S  local, global;

    defineStruct(&structtype);
    MPI_Op_create(max_struct, 1, &maxstruct);

    // Just some random values
    local.a = world_rank;
    local.b = world_size - world_rank;

    MPI_Allreduce(&local, &global, 1, structtype, maxstruct, MPI_COMM_WORLD);  
          
    if(world_rank == 0){
      double c = global.a;
      printf("%f %f\n", global.b, c);
    }

    MPI_Finalize();
    return 0;
 }