TBB lambda 与自写 body object

Question

我正在尝试使用以下代码进行比较串行和并行的性能 for（non-lambda 和 lambda）。

#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"

using namespace std;
using namespace tbb;
void squarecalc(int a)
{
    a *= a;
}
void serial_apply_square(int* a)
{
    for (int i = 0; i<MAX; i++)
        squarecalc(*(a + i));
}

class apply_square
{
    int* my_a;
public:
    void operator()(const blocked_range<size_t>& r) const
    {
        int *a = my_a;
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
    {
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    );
}

int main()
{
    std::chrono::time_point<std::chrono::system_clock> start, end;
    int i = 0;
    int* a = new int[MAX];

    fstream of;
    of.open("newfile", ios::in);
    while (i<MAX)
    {
        of >> a[i];
        i++;
    }

    start = std::chrono::system_clock::now();
    serial_apply_square(a);
    end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    cout << "\nTime for serial execution  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [without lambda]  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square_lambda(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
    free(a);
}

简而言之，它只是以串行和并行方式计算 10000000 个数字的平方。下面是我多次执行的输出 object代码。

**1st execution**

Time for serial execution  :0.043183

Time for parallel execution [without lambda]  :0.035238

Time for parallel execution [with lambda]  :0.036719

**2nd execution**

Time for serial execution  :0.043252

Time for parallel execution [without lambda]  :0.035403

Time for parallel execution [with lambda]  :0.036811

**3rd execution**

Time for serial execution  :0.043241

Time for parallel execution [without lambda]  :0.035355

Time for parallel execution [with lambda]  :0.036558

**4th execution**

Time for serial execution  :0.043216

Time for parallel execution [without lambda]  :0.035491

Time for parallel execution [with lambda]  :0.036697

认为并行执行时间比串行执行少所有情况下的时间，我很好奇为什么 lambda 方法时间更高比 body object 是自己编写的其他并行版本。

为什么 lambda 版本总是花费更多时间？
是不是因为编译器自己创建的开销body object?
如果以上问题的答案是肯定的，是lambda版本不如 self-written 版本？

编辑

下面是优化代码（级别-O2）的结果

**1st execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.00055

Time for parallel execution [with lambda]  :1e-05

**2nd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000583

Time for parallel execution [with lambda]  :9e-06

**3rd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000554

Time for parallel execution [with lambda]  :9e-06

现在优化的代码似乎在串行部分显示出更好的结果和兰巴兼职改进。

这是否意味着并行代码性能总是需要用优化代码？

Answer 1

Does this mean that parallel code performance always need to be tested with optimized code?

任何代码性能都必须使用优化代码进行测试。您想在调试期间或在实际使用软件时优化代码以实现快速运行吗？

你的代码中的主要问题是你的循环没有做任何工作（squarecalc 甚至很可能 serial_apply_square(int* a) 完全优化了）并且测量的时间太短而无法服务作为不同结构在现实生活中表现的指标。

TBB lambda 与自写 body object

TBB lambda vs self written body object

c++

tbb