带有 OpenMP 的 C++ 尝试避免紧密循环数组的错误共享

Question

我尝试将 OpenMP 引入到我的 C++ 代码中，以使用如下所示的简单案例来提高性能：

#include <omp.h>
#include <chrono>
#include <iostream>
#include <cmath>

using std::cout;
using std::endl;

#define NUM 100000

int main()
{
    double data[NUM] __attribute__ ((aligned (128)));;

    #ifdef _OPENMP
        auto t1 = omp_get_wtime();
    #else
        auto t1 = std::chrono::steady_clock::now();
    #endif

    for(long int k=0; k<100000; ++k)
    {

        #pragma omp parallel for schedule(static, 16) num_threads(4)
        for(long int i=0; i<NUM; ++i)
        {
            data[i] = cos(sin(i*i+ k*k));
        }
    }

    #ifdef _OPENMP
        auto t2 = omp_get_wtime();
        auto duration = t2 - t1;
        cout<<"OpenMP Elapsed time (second): "<<duration<<endl;
    #else
        auto t2 = std::chrono::steady_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
        cout<<"No OpenMP Elapsed time (second): "<<duration/1e6<<endl;
    #endif

    double tempsum = 0.;
    for(long int i=0; i<NUM; ++i)
    {
        int nextind = (i == 0 ? 0 : i-1);
        tempsum += i + sin(data[i]) + cos(data[nextind]);
    }
    cout<<"Raw data sum: "<<tempsum<<endl;
    return 0;    
}

访问紧密循环的 int 数组（大小 = 10000）并以并行或非并行方式更改其元素。

构建为

g++ -o test test.cpp

或

g++ -o test test.cpp -fopenmp

程序报告结果为：

No OpenMP Elapsed time (second): 427.44
Raw data sum: 5.00009e+09

OpenMP Elapsed time (second): 113.017
Raw data sum: 5.00009e+09

英特尔第 10 CPU、Ubuntu 18.04、GCC 7.5、OpenMP 4.5。

~~怀疑是cache line中的虚假共享导致OpenMP版本代码性能不佳~~

我在增加循环大小后更新了新的测试结果，OpenMP 运行速度符合预期。

谢谢！

Answer 1

由于您正在编写 C++，因此请使用 C++ 随机数生成器，它是线程安全的，与您使用的 C 遗留生成器不同。
此外，您没有使用数据数组，因此编译器实际上可以自由地完全删除循环。
在执行定时循环之前，您应该触摸所有数据一次。这样你就可以确保页面被实例化并且数据在缓存中或缓存外取决于。
你的循环很短。

Answer 2

rand() 不是 thread-safe（详见 here). Use an array of C++ random-number generators instead, one for each thread. See std::uniform_int_distribution。
您可以在代码中删除 #ifdef _OPENMP 变体。在 Bash 终端中，您可以将您的应用程序称为 OMP_NUM_THREADS=1 test。有关详细信息，请参阅 here。
因此您也可以删除 num_threads(4)，因为您可以明确指定并行度。
使用 Google Benchmark 或 command-line 参数，以便您可以参数化线程数和数组大小。

从这里，我希望你会看到：

您调用 OMP_NUM_THREADS=1 test 时的性能接近您的 non-OpenMP 版本。
C++ RNG 生成器数组比从多个线程调用 rand() 更快。
使用 10,000 个元素的数组时，multi-threaded 版本仍然比 single-threaded 版本慢。

带有 OpenMP 的 C++ 尝试避免紧密循环数组的错误共享

C++ with OpenMP try to avoid the false sharing for tight looped array

c++

arrays

ubuntu

openmp

false-sharing