并行化内部循环

Question

我是 OpenMP 的新手，所以如果我有这个错误也没关系。我也没有成功找到有关此的信息，但我确定我错过了一些明显的东西。

我有一些嵌套循环，我想以某种方式并行化。

这是一个顺序版本。注意 f(i) 是一个大约在 100 到 100,000 之间的较大整数。

for (int a = 0; a < 10; a++)
{
    for (int b = 0; b < 10; b++)
    {
        for (int c = 0; c < f(a); c++)
        {
            for (int d = 0; d < f(b); d++)
            {
                if (comp(c, d))
                {
                    result[a][b]++;
                }
            }
        }
    }
}

我天真地想到了这种并行化代码的方法。

#pragma omp parallel
{
    // Create a result_local array to avoid critical sections in the loop

    #pragma omp for collapse(2) schedule(guided) nowait
    for (int a = 0; a < 10; a++)
    {
        for (int b = 0; b < 10; b++)
        {
            for (int c = 0; c < f(a); c++)
            {
                for (int d = 0; d < f(b); d++)
                {
                    if (comp(c, d))
                    {
                        result_local[a][b]++;
                    } 
                }
            }
        }
    }

    // Add the result_local to result
}

这是我不太确定的部分。如果我的理解是正确的，OpenMP 将不会并行化 c 和 d 循环，这意味着每个线程将完整地执行一个 c 循环。给定 f(i) 可以 return 相对较低的数字（如 100）或相对较高的数字（如 100,000），这意味着一些线程可能会比其他线程卡住更多的工作，这并不理想。

那么问题是我如何并行化内部循环以更好地共享工作。我无法将 collapse(2) 更改为 collapse(4)，因为 c 和 d 循环迭代到一个数字，该数字是 a 和 [=23 的函数=]变量。

我在研究中看到了一些可能有用的东西。

#pragma omp parallel
{
    // Create a result_local array to avoid critical sections in the loop

    for (int a = 0; a < 10; a++)
    {
        for (int b = 0; b < 10; b++)
        {
            #pragma omp parallel for collapse(2) schedule(guided)
            for (int c = 0; c < f(a); c++)
            {
                for (int d = 0; d < f(b); d++)
                {
                    if (comp(c, d))
                    {
                        result_local[a][b]++;
                    } 
                }
            }
        }
    }

    // Add the result_local to result
}

诚然，我的知识还不够，无法确定这是否有帮助。我所看到的表明这可能是并行化 c 和 d 循环，但让 a 和 b 循环串行化？

感谢任何帮助。

Answer 1

为了使内部循环并行化的尝试有机会起作用，您需要对 result_local:

的数据竞争做一些事情

如果你有足够的内存让每个线程都有自己的result_local私有版本，你可以在pragma中指定reduction(+: result_local[:10][:10])，但我没有'尚未将其与多维数组一起使用。您可能必须使用线性数组和“词法索引”(idx = a * 10 + b)。如果 result_local 是动态分配的（在堆上），这可能是处理它的更好方法（由于缓存局部性，比某些 std::vector<std::vector<int>> 更好）。
如果 comp 的计算量足够大，将 #pragma omp atomic update 放在 result_local[a][b]++ 前面可能会更好。这需要更少的内存。在您的 a * b == 100 示例中，内存可能不是问题。

由于最内层循环内的分支可能对性能不利，您可能想尝试 result_local[a][b] += comp(c, d); 是否提供更好的性能，因为加法非常便宜。

Answer 2

omp will not parallelize the c and d loops meaning each thread will execute a c loop in its entirety.

这是正确的。

some of the threads might get stuck with a lot more work than other threads

你是对的：线程之间的工作不平衡是第一个代码中的性能问题。 schedule(dynamic) 可以帮助解决此问题，但在此版本上您无能为力。

I don't know enough to know if this helpful at all. What I saw indicates this might be parallelizing the c and d loops but leaving the a and b loops serial?

从技术上讲，a 和 b 循环也是并行执行的（因为它们位于并行部分，但所有线程将完全同步执行所有迭代（因为 omp parallel for 包含隐式同步。你不应该使用第二个 omp parallel：关于运行时，这可以创建新线程 100 次，即使没有创建新线程，这也会导致代码效率低下（例如因为默认线程固定错误）。此外，这里不需要 schedule(guided) 并且应该比 schedule(static) 效率低。因此，使用 omp for collapse(2) schedule(static).

how can I parallelize the inner loops to share the work better.

最后一段代码在工作平衡方面还算不错，尽管它引入了一些不需要的开销：

可以使用 nowait 跳过 omp for 的隐式同步，因为所有线程都在处理线程私有数据。
对result_local[a][b]的访问可以用快速的线程私有变量访问代替。
条件增量可以用无分支布尔增量代替。
f(a) 和 f(b) 可以按计算计算，尽管优化编译器应该已经这样做了。
当f(a) * f(b)非常小时，最好不要并行执行循环（因为核心之间通信的成本很高）。然而，这在很大程度上取决于 cond 是否昂贵。
当 f(a) 很大时，不需要使用昂贵的 collapse(2)，因为所有线程都有足够的工作（collapse(2) 通常会减慢执行速度，因为编译器通常生成慢模指令以在运行时查找循环迭代器的值。

这是考虑到大多数修复的结果代码：

#pragma omp parallel
{
    // Create a result_local array to avoid critical sections in the loop

    // Arbritrary threshold (this may not be optimal)
    const int threshold = 4 * omp_get_num_threads();

    for (int a = 0; a < 10; a++)
    {
        const int c_lim = f(a);

        for (int b = 0; b < 10; b++)
        {
            const int d_lim = f(b);
            int64_t local_sum = 0;

            if(c_lim < threshold)
            {
                #pragma omp for collapse(2) schedule(static) nowait
                for (int c = 0; c < c_lim; c++)
                    for (int d = 0; d < d_lim; d++)
                        local_sum += comp(c, d);
            }
            else
            {
                #pragma omp for schedule(static) nowait
                for (int c = 0; c < c_lim; c++)
                    for (int d = 0; d < d_lim; d++)
                        local_sum += comp(c, d);
            }

            result_local[a][b] += local_sum;
        }
    }

    // Add the result_local to result
}

另一个更有效的策略是重新设计顺序算法以显着减少工作量。

重新设计算法

可以注意到，comp(c, d) 多次使用相同的值重新计算（最多 100 次），result_local[a][b]++ 甚至 f(b) 也相同（最多 1,000,000 次） .在这种情况下，通用解决方案是记忆结果（有关详细信息，请参阅here）以避免一遍又一遍地重新计算算法的昂贵部分。

请注意，您无法预先计算所有需要的 comp(a, b) 值：此解决方案在内存使用方面过于昂贵（最多需要 10 Gio）。因此，诀窍是将 2D space 拆分为 tiles。以下是该算法的工作原理：

按顺序计算所有 f(a) 和 f(b)（100 个值）；
将迭代 space 拆分为合理大小（例如 100x100）的图块，并预先计算所有需要完全计算的图块（可能是并行的，尽管这很乏味）；
计算每个图块的所有 comp(a, b) 的总和（即 [a_tile_begin;a_tile_end[ 中的 a 和 [b_tile_begin;b_tile_end[ 中的 b）并行（每个线程应该在多个图块上工作）并将总和写入共享数组。
使用图块总和（部分图块是在最后一步中即时计算的）并行计算最终结果。

这个算法绝对复杂得多，但它应该比上面的算法快 100 倍，因为大多数操作都是只计算一次。

并行化内部循环

Parallelize inner loops

c++

openmp

重新设计算法