MPI_Wtime 计时器在 OpenMPI 2.0.2 中的运行速度提高了约 2 倍
MPI_Wtime timer runs about 2 times faster in OpenMPI 2.0.2
将 OpenMPI 从 1.8.4 更新到 2.0.2 后,我 运行 使用 MPI_Wtime() 进行了错误的时间测量。对于 1.8.4 版,结果与 omp_get_wtime() 计时器返回的结果相同,现在 MPI_Wtime 运行速度大约快 2 倍。
什么会导致这种行为?
我的示例代码:
#include <omp.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int some_work(int rank, int tid){
int count = 10000;
int arr[count];
for( int i=0; i<count; i++)
arr[i] = i + tid + rank;
for( int val=0; val<4000000; val++)
for(int i=0; i<count-1; i++)
arr[i] = arr[i+1];
return arr[0];
}
int main (int argc, char *argv[]) {
MPI_Init(NULL, NULL);
int rank, size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
printf("there are %d mpi processes\n", size);
MPI_Barrier(MPI_COMM_WORLD);
double omp_time1 = omp_get_wtime();
double mpi_time1 = MPI_Wtime();
#pragma omp parallel
{
int tid = omp_get_thread_num();
if ( tid == 0 ) {
int nthreads = omp_get_num_threads();
printf("There are %d threads for process %d\n", nthreads, rank);
int result = some_work(rank, tid);
printf("result for process %d thread %d is %d\n", rank, tid, result);
}
}
MPI_Barrier(MPI_COMM_WORLD);
double mpi_time2 = MPI_Wtime();
double omp_time2 = omp_get_wtime();
printf("process %d omp time: %f\n", rank, omp_time2 - omp_time1);
printf("process %d mpi time: %f\n", rank, mpi_time2 - mpi_time1);
printf("process %d ratio: %f\n", rank, (mpi_time2 - mpi_time1)/(omp_time2 - omp_time1) );
MPI_Finalize();
return EXIT_SUCCESS;
}
正在编译
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-2.0.2/include -L /usr/mpi/gcc/openmpi-2.0.2/lib -lmpi
和运行
salloc -N2 -n2 mpirun --map-by ppr:1:node:pe=16 bin/example
类似
there are 2 mpi processes
There are 16 threads for process 0
There are 16 threads for process 1
result for process 1 thread 0 is 10000
result for process 0 thread 0 is 9999
process 1 omp time: 5.066794
process 1 mpi time: 10.098752
process 1 ratio: 1.993125
process 0 omp time: 5.066816
process 0 mpi time: 8.772390
process 0 ratio: 1.731342
这个比例和我一开始写的不一致,但还是够大的。
OpenMPI 1.8.4 的结果正常:
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-1.8.4/include -L /usr/mpi/gcc/openmpi-1.8.4/lib -lmpi -lmpi_cxx
给予
result for process 0 thread 0 is 9999
result for process 1 thread 0 is 10000
process 0 omp time: 4.655244
process 0 mpi time: 4.655232
process 0 ratio: 0.999997
process 1 omp time: 4.655335
process 1 mpi time: 4.655321
process 1 ratio: 0.999997
也许 MPI_Wtime() 本身就是一项成本高昂的操作?
如果您避免将 MPI_Wtime() 消耗的时间作为 OpenMP-Time 的一部分进行测量,结果是否会变得更加一致?
例如:
double mpi_time1 = MPI_Wtime();
double omp_time1 = omp_get_wtime();
/* do something */
double omp_time2 = omp_get_wtime();
double mpi_time2 = MPI_Wtime();
我的集群上有类似的行为(与您的 OpenMPI 版本相同,2.0.2),问题是 CPU 频率的默认调控器,即 'conservative' 频率。
一旦将调速器设置为 'performance',MPI_Wtime() 的输出与正确的时间对齐(在我的例子中是 'time' 的输出)。
看起来,对于一些较旧的 Xeon 处理器(如 Xeon E5620),当使用过于激进的动态频率调整策略时,某些时钟功能会出现偏差 - 同一 OpenMPI 版本在同一集群中的较新 Xeon 上不会遇到此问题。
将 OpenMPI 从 1.8.4 更新到 2.0.2 后,我 运行 使用 MPI_Wtime() 进行了错误的时间测量。对于 1.8.4 版,结果与 omp_get_wtime() 计时器返回的结果相同,现在 MPI_Wtime 运行速度大约快 2 倍。
什么会导致这种行为?
我的示例代码:
#include <omp.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int some_work(int rank, int tid){
int count = 10000;
int arr[count];
for( int i=0; i<count; i++)
arr[i] = i + tid + rank;
for( int val=0; val<4000000; val++)
for(int i=0; i<count-1; i++)
arr[i] = arr[i+1];
return arr[0];
}
int main (int argc, char *argv[]) {
MPI_Init(NULL, NULL);
int rank, size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
printf("there are %d mpi processes\n", size);
MPI_Barrier(MPI_COMM_WORLD);
double omp_time1 = omp_get_wtime();
double mpi_time1 = MPI_Wtime();
#pragma omp parallel
{
int tid = omp_get_thread_num();
if ( tid == 0 ) {
int nthreads = omp_get_num_threads();
printf("There are %d threads for process %d\n", nthreads, rank);
int result = some_work(rank, tid);
printf("result for process %d thread %d is %d\n", rank, tid, result);
}
}
MPI_Barrier(MPI_COMM_WORLD);
double mpi_time2 = MPI_Wtime();
double omp_time2 = omp_get_wtime();
printf("process %d omp time: %f\n", rank, omp_time2 - omp_time1);
printf("process %d mpi time: %f\n", rank, mpi_time2 - mpi_time1);
printf("process %d ratio: %f\n", rank, (mpi_time2 - mpi_time1)/(omp_time2 - omp_time1) );
MPI_Finalize();
return EXIT_SUCCESS;
}
正在编译
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-2.0.2/include -L /usr/mpi/gcc/openmpi-2.0.2/lib -lmpi
和运行
salloc -N2 -n2 mpirun --map-by ppr:1:node:pe=16 bin/example
类似
there are 2 mpi processes
There are 16 threads for process 0
There are 16 threads for process 1
result for process 1 thread 0 is 10000
result for process 0 thread 0 is 9999
process 1 omp time: 5.066794
process 1 mpi time: 10.098752
process 1 ratio: 1.993125
process 0 omp time: 5.066816
process 0 mpi time: 8.772390
process 0 ratio: 1.731342
这个比例和我一开始写的不一致,但还是够大的。
OpenMPI 1.8.4 的结果正常:
g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-1.8.4/include -L /usr/mpi/gcc/openmpi-1.8.4/lib -lmpi -lmpi_cxx
给予
result for process 0 thread 0 is 9999
result for process 1 thread 0 is 10000
process 0 omp time: 4.655244
process 0 mpi time: 4.655232
process 0 ratio: 0.999997
process 1 omp time: 4.655335
process 1 mpi time: 4.655321
process 1 ratio: 0.999997
也许 MPI_Wtime() 本身就是一项成本高昂的操作? 如果您避免将 MPI_Wtime() 消耗的时间作为 OpenMP-Time 的一部分进行测量,结果是否会变得更加一致? 例如:
double mpi_time1 = MPI_Wtime();
double omp_time1 = omp_get_wtime();
/* do something */
double omp_time2 = omp_get_wtime();
double mpi_time2 = MPI_Wtime();
我的集群上有类似的行为(与您的 OpenMPI 版本相同,2.0.2),问题是 CPU 频率的默认调控器,即 'conservative' 频率。 一旦将调速器设置为 'performance',MPI_Wtime() 的输出与正确的时间对齐(在我的例子中是 'time' 的输出)。 看起来,对于一些较旧的 Xeon 处理器(如 Xeon E5620),当使用过于激进的动态频率调整策略时,某些时钟功能会出现偏差 - 同一 OpenMPI 版本在同一集群中的较新 Xeon 上不会遇到此问题。