TBB lambda 与自写 body object
TBB lambda vs self written body object
我正在尝试使用以下代码进行比较
串行和并行的性能 for(non-lambda 和 lambda)。
#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
void squarecalc(int a)
{
a *= a;
}
void serial_apply_square(int* a)
{
for (int i = 0; i<MAX; i++)
squarecalc(*(a + i));
}
class apply_square
{
int* my_a;
public:
void operator()(const blocked_range<size_t>& r) const
{
int *a = my_a;
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
{
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
);
}
int main()
{
std::chrono::time_point<std::chrono::system_clock> start, end;
int i = 0;
int* a = new int[MAX];
fstream of;
of.open("newfile", ios::in);
while (i<MAX)
{
of >> a[i];
i++;
}
start = std::chrono::system_clock::now();
serial_apply_square(a);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << "\nTime for serial execution :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [without lambda] :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square_lambda(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
free(a);
}
简而言之,它只是以串行和并行方式计算 10000000 个数字的平方。下面是我多次执行的输出
object代码。
**1st execution**
Time for serial execution :0.043183
Time for parallel execution [without lambda] :0.035238
Time for parallel execution [with lambda] :0.036719
**2nd execution**
Time for serial execution :0.043252
Time for parallel execution [without lambda] :0.035403
Time for parallel execution [with lambda] :0.036811
**3rd execution**
Time for serial execution :0.043241
Time for parallel execution [without lambda] :0.035355
Time for parallel execution [with lambda] :0.036558
**4th execution**
Time for serial execution :0.043216
Time for parallel execution [without lambda] :0.035491
Time for parallel execution [with lambda] :0.036697
认为并行执行时间比串行执行少
所有情况下的时间,我很好奇为什么 lambda 方法时间更高
比 body object 是自己编写的其他并行版本。
- 为什么 lambda 版本总是花费更多时间?
- 是不是因为编译器自己创建的开销body
object?
- 如果以上问题的答案是肯定的,是lambda版本
不如 self-written 版本?
编辑
下面是优化代码(级别-O2)的结果
**1st execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.00055
Time for parallel execution [with lambda] :1e-05
**2nd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000583
Time for parallel execution [with lambda] :9e-06
**3rd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000554
Time for parallel execution [with lambda] :9e-06
现在优化的代码似乎在串行部分显示出更好的结果
和兰巴兼职改进。
这是否意味着并行代码性能总是需要用
优化代码?
Does this mean that parallel code performance always need to be tested with optimized code?
任何代码性能都必须使用优化代码进行测试。您想在调试期间或在实际使用软件时优化代码以实现快速运行吗?
你的代码中的主要问题是你的循环没有做任何工作(squarecalc
甚至很可能 serial_apply_square(int* a)
完全优化了)并且测量的时间太短而无法服务作为不同结构在现实生活中表现的指标。
我正在尝试使用以下代码进行比较
串行和并行的性能 for(non-lambda 和 lambda)。
#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
void squarecalc(int a)
{
a *= a;
}
void serial_apply_square(int* a)
{
for (int i = 0; i<MAX; i++)
squarecalc(*(a + i));
}
class apply_square
{
int* my_a;
public:
void operator()(const blocked_range<size_t>& r) const
{
int *a = my_a;
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
{
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
);
}
int main()
{
std::chrono::time_point<std::chrono::system_clock> start, end;
int i = 0;
int* a = new int[MAX];
fstream of;
of.open("newfile", ios::in);
while (i<MAX)
{
of >> a[i];
i++;
}
start = std::chrono::system_clock::now();
serial_apply_square(a);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << "\nTime for serial execution :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [without lambda] :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square_lambda(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
free(a);
}
简而言之,它只是以串行和并行方式计算 10000000 个数字的平方。下面是我多次执行的输出 object代码。
**1st execution**
Time for serial execution :0.043183
Time for parallel execution [without lambda] :0.035238
Time for parallel execution [with lambda] :0.036719
**2nd execution**
Time for serial execution :0.043252
Time for parallel execution [without lambda] :0.035403
Time for parallel execution [with lambda] :0.036811
**3rd execution**
Time for serial execution :0.043241
Time for parallel execution [without lambda] :0.035355
Time for parallel execution [with lambda] :0.036558
**4th execution**
Time for serial execution :0.043216
Time for parallel execution [without lambda] :0.035491
Time for parallel execution [with lambda] :0.036697
认为并行执行时间比串行执行少
所有情况下的时间,我很好奇为什么 lambda 方法时间更高
比 body object 是自己编写的其他并行版本。
- 为什么 lambda 版本总是花费更多时间?
- 是不是因为编译器自己创建的开销body object?
- 如果以上问题的答案是肯定的,是lambda版本 不如 self-written 版本?
编辑
下面是优化代码(级别-O2)的结果
**1st execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.00055
Time for parallel execution [with lambda] :1e-05
**2nd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000583
Time for parallel execution [with lambda] :9e-06
**3rd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000554
Time for parallel execution [with lambda] :9e-06
现在优化的代码似乎在串行部分显示出更好的结果 和兰巴兼职改进。
这是否意味着并行代码性能总是需要用 优化代码?
Does this mean that parallel code performance always need to be tested with optimized code?
任何代码性能都必须使用优化代码进行测试。您想在调试期间或在实际使用软件时优化代码以实现快速运行吗?
你的代码中的主要问题是你的循环没有做任何工作(squarecalc
甚至很可能 serial_apply_square(int* a)
完全优化了)并且测量的时间太短而无法服务作为不同结构在现实生活中表现的指标。