在 C++ OpenMP 中以两种方式使用 Monte Carlo 方法计算圆周率

Calculating pi using Monte Carlo method in two ways in C++ OpenMP

什么方法应该更快? 第一种方法是增加一个变量以减少:

#pragma omp parallel private(seed, x, y, i) reduction (+:counter)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
                y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter++;
        } 

}

第二个是每个进程使用 table 个增量变量,最后,这个 table 中的元素总和是结果:

#pragma omp parallel private(seed, x, y, i , nproc)
{
    seed = 25234 + 17 * omp_get_thread_num();
    nproc = omp_get_thread_num();
    #pragma omp parallel for
    for(i=0; i<prec/8; i++){
        x = (double)rand_r(&seed) / RAND_MAX;
        y = (double)rand_r(&seed) / RAND_MAX;
        if(x*x+y*y<1){
            counter[nproc]++;
        } 

    }
}

double time = omp_get_wtime() - start_time;
int sum=0;
for(int i=0; i<8; i++){
    sum+=counter[i];

} 

理论上,第二种方式应该更快,因为进程不是共享一个变量,而是每个进程都有自己的变量。 但是当我计算执行时间时:

first approach: 3.72423 [s]

second approach: 8.94479[s]

我的想法是错误的还是我的代码做错了什么?

你是 false sharing in the second approach. Here an interesting article from Intel 的受害者。

False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

If two processors operate on independent data in the same memory address region storable in a single line, the cache coherency mechanisms in the system may force the whole line across the bus or interconnect with every data write, forcing memory stalls in addition to wasting system bandwidth

直觉上,我不认为第一种方法应该更慢。
您确实在每个线程上创建了一个私有副本,然后将最终结果应用到一个全局变量中。行为在某种程度上与您的共享数组相同,但这里的问题是即使您的访问是独立的,您也会得到错误的共享。