即使在 num_threads(1) 时，openmp 的性能提升也难以理解

Question

下面几行代码

int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char *buff = (unsigned char *) malloc( numel );

unsigned char *pbuff = buff;
#pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

使用

编译时，在我的 i5-3230M 上运行需要 11130 usecs

g++ -o main main.cpp -std=c++0x -O3

也就是说，当 openmp pragmas 被忽略时。

另一方面，使用

编译时只需要1496 usecs

g++ -o main main.cpp -std=c++0x -O3 -fopenmp

这快了 6 倍多，考虑到它在 2 核机器上是运行，这是相当令人惊讶的。其实我也用num_threads(1)测试过，性能提升还是蛮重要的（快了3倍以上）。

任何人都可以帮助我理解这种行为吗？

编辑：根据建议，我提供了完整的代码：

#include <stdlib.h>
#include <iostream>

#include <chrono>
#include <cassert>


int nrows = 4096;
int ncols = 4096;
size_t numel = nrows * ncols;
unsigned char * buff;


void func()
{
    unsigned char *pbuff = buff;
    #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1)
    for (int i=0; i<nrows; i++)
    {
        for (int j=0; j<ncols; j++)
        {
            *pbuff += 1;
            pbuff++;
        }
    }
}


int main()
{
    // alloc & initializacion
    buff = (unsigned char *) malloc( numel );
    assert(buff != NULL);
    for(int k=0; k<numel; k++)
        buff[k] = 0;

    //
    std::chrono::high_resolution_clock::time_point begin;
    std::chrono::high_resolution_clock::time_point end;
    begin = std::chrono::high_resolution_clock::now();      
    //
    for(int k=0; k<100; k++)
        func();
    //
    end = std::chrono::high_resolution_clock::now();
    auto usec = std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count();
    std::cout << "func average running time: " << usec/100 << " usecs" << std::endl;

    return 0;
}

Answer 1

事实证明，答案是 firstprivate(pbuff, nrows, ncols) 有效地将 pbuff、nrows 和 ncols 声明为 for 循环范围内的局部变量。这反过来意味着编译器可以将 nrows 和 ncols 视为常量 - 它不能对全局变量做出相同的假设！

因此，使用 -fopenmp，您最终会获得巨大的加速，因为 您不会在每次迭代时访问全局变量。（另外，使用常量 ncols 值，编译器会进行一些循环展开）。

通过改变

int nrows = 4096;
int ncols = 4096;

至

const int nrows = 4096;
const int ncols = 4096;

或通过改变

for (int i=0; i<nrows; i++)
{
    for (int j=0; j<ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

至

int _nrows = nrows;
int _ncols = ncols;
for (int i=0; i<_nrows; i++)
{
    for (int j=0; j<_ncols; j++)
    {
        *pbuff += 1;
        pbuff++;
    }
}

异常加速消失 - 非 OpenMP 代码现在与 OpenMP 代码一样快。

故事的寓意？避免在性能关键循环中访问可变全局变量。

即使在 num_threads(1) 时，openmp 的性能提升也难以理解

incomprehensible performance improvement with openmp even when num_threads(1)

c++

openmp