OpenMP 并行化未按预期执行

Question

我的代码如下所示：

void SimulationStep ( float *In, float *Out, float L, int N)
{
   
  Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
    for (int x=1; x<N-1; x++)
    {
      Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1]; 
    }
  Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}

我正在尝试将其并行化。我已经尝试了很多事情，这是一个例子：

void SimulationStep ( float *In, float *Out, float L, int N)
{
   
  Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
    #pragma omp parallel for 
    for (int x=1; x<N-1; x++)
    {
      Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1]; 
    }
  Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}

我应用的更改只增加了 0.5 秒，从 14 秒到 13.5 秒，所以我怀疑代码确实没有并行化。我认为这可能是内存瓶颈，所以我不知道该怎么办。先感谢您。 Ps：我正在使用 gcc/9.2.0 和 -03 -fopenmp.

进行编译

Answer 1

so I don't know what to do

你必须测量。

我只做了一个简单的for循环来填充一个数组，这只花了一半的时间。我用 10 个 Mio 浮点数制作了两个全局数组。

比较：

$ cc simst.c  
$ time ./a.out 
Fill  39577us
Simu  42061us

real    0m0.089s
user    0m0.064s
sys     0m0.025s

与-O3:

$ cc simst.c -O3 
$ time ./a.out 
Fill  23295us
Simu  14735us

real    0m0.044s
user    0m0.017s
sys     0m0.028s

现在填充需要的时间相对更长。

OpenMP:

$ cc simst.c -O3 -fopenmp
$ time ./a.out 
Fill  23044us
Simu   6345us

real    0m0.036s
user    0m0.031s
sys     0m0.035s

并行化循环的速度是原来的两倍多，但整体效果不大。

加上特定于 arch 的开关：

$ cc simst.c -O3 -march=skylake -fopenmp

$ time ./a.out 
Fill  16781us
Simu   6052us

real    0m0.030s
user    0m0.023s
sys     0m0.036s

因此 OpenMP 使函数快了 2-3 倍，但并没有真正体现在整体时间上。

时间安排是这样的：

...
clock_gettime(clock_id, &ts1);
dt.tv_nsec = ts1.tv_nsec - ts.tv_nsec;
printf("Fill %6ldus\n", dt.tv_nsec/1000);

SimulationStep(in,  out, 50.0, LEN);

clock_gettime(clock_id, &ts2);
dt.tv_nsec = ts2.tv_nsec - ts1.tv_nsec;
printf("Simu %6ldus\n", dt.tv_nsec/1000);

I have a code...

您确实有更多代码

Answer 2

@Black_Alistar 表示 Amdahl's Law，加速量受限于可并行化的可执行文件的比例。看起来您的应用程序大部分时间都在模拟循环之外。

当我看到执行时间的边际改善时，我立即认为您的应用程序是 I/O 有界的：

如果您在模拟中存储中间步骤，那么您的应用程序很可能是 I/O 有界的而不是 CPU 有界的。因此，使用任何并行算法几乎没有区别；
读取或写入速度慢可能是因为您一次处理一项数据；
使用无缓冲时，不要读取或写入小块 I/O。

OpenMP 并行化未按预期执行

OpenMP Parallelizing not performing as expected

c

parallel-processing

gcc

openmp