tbb::paralllel_for 数组的非递归拆分迭代

Question

我正在尝试使用英特尔的 TBB 对一维数组 A[] 执行计算。问题是，默认情况下，像 tbb::parallel_for 这样的算法会递归地将数组切成两半，将每个块发送到任务池以供线程窃取。

但是，我希望所有线程以线性方式 "scan" 数组。例如，使用 4 个线程让它们以任意顺序并行计算第一个 A[0], A[1], A[2] 和 A[3]。然后，以任意顺序计算集合 A[4], A[5], A[6] 和 A[7]。

现在，parallel_for，经过几次递归拆分后，将首先分别计算 A[0], A[2], A[4] 和 A[6]。然后，A[1], A[3], A[5] 和 A[7]（或类似的东西）。

我正在使用 C++14 和英特尔的 Threading Building Blocks。 parallel_reduce 或 parallel_scan 等算法以类似的方式运行，关于迭代的拆分 space，因此它们没有太大帮助。

我的猜测是我确实定义了自己的迭代 space 对象，但我不知道具体是如何定义的。 docs给出这样的定义：

class R {
    // True if range is empty
    bool empty() const;
    // True if range can be split into non-empty subranges
    bool is_divisible() const;
    // Splits r into subranges r and *this
    R( R& r, split );
    // Splits r into subranges r and *this in proportion p
    R( R& r, proportional_split p );
    // Allows usage of proportional splitting constructor
    static const bool is_splittable_in_proportion = true;
    ...
};

这一切都归结为这段代码：

#include <mutex>
#include <iostream>
#include <thread>

#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>

std::mutex cout_mutex;

int main()
{
    auto N = 8;

    tbb::task_scheduler_init init(4);

    tbb::parallel_for(tbb::blocked_range<int>(0, N),
        [&](const tbb::blocked_range<int>& r)
        {
            for (int j = r.begin(); j < r.end(); ++j) {
                // Compute A[j]
                std::this_thread::sleep_for(std::chrono::seconds(1));
                cout_mutex.lock();
                std::cout << std::this_thread::get_id()<< ", " << j << std::endl;
                cout_mutex.unlock();
            }
        }
    );
}

以上代码给出：

140455557347136, 0
140455526110976, 4
140455521912576, 2
140455530309376, 6
140455526110976, 5
140455557347136, 1
140455521912576, 3
140455530309376, 7

但我想要这样的东西：

140455557347136, 0
140455526110976, 1
140455521912576, 2
140455530309376, 3
140455526110976, 5
140455557347136, 4
140455521912576, 6
140455530309376, 7

对迭代对象有什么建议或者有更好的解决方案吗？

Answer 1

考虑使用外部原子，例如（// !!! 标记更改行）

#include <mutex>
#include <iostream>
#include <thread>

#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>

#include <atomic>                                 // !!!

std::mutex cout_mutex;

int main()
{
    auto N = 8;

    tbb::task_scheduler_init init(4);

    std::atomic<int> monotonic_begin{0};           // !!!

    tbb::parallel_for(tbb::blocked_range<int>(0, N),
        [&](const tbb::blocked_range<int>& r)
        {
            int s = static_cast<int>(r.size());    // !!!
            int b = monotonic_begin.fetch_add(s);  // !!!
            int e = b + s;                         // !!!
            for (int j = b; j < e; ++j) {          // !!!       
                // Compute A[j]
                std::this_thread::sleep_for(std::chrono::seconds(1));
                cout_mutex.lock();
                std::cout << std::this_thread::get_id() << ", " << j << std::endl;
                cout_mutex.unlock();
            }
        }
    );
}

该方法给出：

为什么单调行为很重要？您可能需要考虑 parallel_pipeline 或流程图来指定计算依赖关系。

tbb::paralllel_for 数组的非递归拆分迭代

Non-recursive split iteration of array with tbb::paralllel_for

c++

parallel-processing

multithreading

tbb