tbb::parallel_reduce 和 std::accumulate 的结果不同
Results of tbb::parallel_reduce and std::accumulate differ
我在学习Intel's TBB library。当对 std::vector
中的所有值求和时,如果向量中的元素超过 16.777.220(在 16.777.320 元素处出现错误),tbb::parallel_reduce
的结果不同于 std::accumulate
。这是我的最低工作示例:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include "tbb/tbb.h"
int main(int argc, const char * argv[]) {
int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works
std::vector<float> heights(size);
std::fill(heights.begin(), heights.end(), 1.0f);
float ssum = std::accumulate(heights.begin(), heights.end(), 0);
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0,
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum;
return 0;
}
在我的 OSX 10.10.3 上输出 XCode 6.3.1 和 tbb 稳定版 4.3-20141023(从 Brew 倾倒):
Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07
这是为什么呢?我应该向 TBB 开发人员报告错误吗?
附加测试,应用您的答案:
correct value is: 1949700403
cause we add 1.0f to zero 1949700403 times
using (int) init values:
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong
using (float) init values:
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong
using (double) initial values:
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong
using (double) initial values and tbb::parallel_deterministic_reduce:
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !
为什么所有的 reduce 调用都会产生错误的总和? (double)
还不够吗?
这是我的测试代码:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include <sys/time.h>
#include <iomanip>
#include "tbb/tbb.h"
#include <cmath>
class StopWatch {
private:
double elapsedTime;
timeval startTime, endTime;
public:
StopWatch () : elapsedTime(0) {}
void startTimer() {
elapsedTime = 0;
gettimeofday(&startTime, 0);
}
void stopNprintTimer() {
gettimeofday(&endTime, 0);
elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // compute sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // compute us to ms and add
std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime / 1000 << " sec."; // show in sec
}
};
int main(int argc, const char * argv[]) {
StopWatch watch;
std::cout << std::fixed << std::setprecision(3) << "" << std::endl;
size_t count = std::numeric_limits<int>::max() * 0.9079;
std::vector<float> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1.0f);
watch.startTimer();
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
return 0;
}
我最后一个问题的答案:它们都会产生错误的结果,因为它们不是为大数的整数加法而设计的。切换到 int 解决了:
[...]
std::vector<int> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1);
watch.startTimer();
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0);
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0,
[](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<int>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
[...]
结果:
Vector size: 1949700403
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster
您对 std::accumulate
的调用正在进行整数加法,然后在计算结束时将结果转换为 float
。为了累加浮点数,累加器应该是 float
*.
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f);
^^^^
* 或任何其他可以正确累积 float
的类型。
这可能会为您解决这个特殊问题:
Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation.
但是浮点加法不是关联运算:
- 随着累积:(...((s+a1)+a2)+...)+an
- 使用 parralel_reduce:可以进行任何括号排列。
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
对于 'why?' 部分的其他正确答案,我还要补充一点,TBB 提供 parallel_deterministic_reduce
which guarantees reproducible results between two and more runs on the same data (but it still can differ with std::accumulate). See the blog 描述问题和确定性算法。
因此关于'Should I report an error to TBB developers?'部分,答案显然是否定的(除非你发现TBB方面的不足)。
我在学习Intel's TBB library。当对 std::vector
中的所有值求和时,如果向量中的元素超过 16.777.220(在 16.777.320 元素处出现错误),tbb::parallel_reduce
的结果不同于 std::accumulate
。这是我的最低工作示例:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include "tbb/tbb.h"
int main(int argc, const char * argv[]) {
int count = std::numeric_limits<int>::max() * 0.0079 - 187800; // - 187900 works
std::vector<float> heights(size);
std::fill(heights.begin(), heights.end(), 1.0f);
float ssum = std::accumulate(heights.begin(), heights.end(), 0);
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0,
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
std::cout << std::endl << " Heights serial sum: " << ssum << " parallel sum: " << psum;
return 0;
}
在我的 OSX 10.10.3 上输出 XCode 6.3.1 和 tbb 稳定版 4.3-20141023(从 Brew 倾倒):
Heights serial sum: 1.67772e+07 parallel sum: 1.67773e+07
这是为什么呢?我应该向 TBB 开发人员报告错误吗?
附加测试,应用您的答案:
correct value is: 1949700403
cause we add 1.0f to zero 1949700403 times
using (int) init values:
Runtime: 17.407 sec. Heights serial sum: 16777216.000, wrong
Runtime: 8.482 sec. Heights parallel sum: 131127368.000, wrong
using (float) init values:
Runtime: 12.594 sec. Heights serial sum: 16777216.000, wrong
Runtime: 5.044 sec. Heights parallel sum: 303073632.000, wrong
using (double) initial values:
Runtime: 13.671 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 5.343 sec. Heights parallel sum: 263690016.000, wrong
using (double) initial values and tbb::parallel_deterministic_reduce:
Runtime: 13.463 sec. Heights serial sum: 1949700352.000, wrong
Runtime: 99.031 sec. Heights parallel sum: 1949700352.000, wrong >>> almost 10x slower !
为什么所有的 reduce 调用都会产生错误的总和? (double)
还不够吗?
这是我的测试代码:
#include <iostream>
#include <vector>
#include <numeric>
#include <limits>
#include <sys/time.h>
#include <iomanip>
#include "tbb/tbb.h"
#include <cmath>
class StopWatch {
private:
double elapsedTime;
timeval startTime, endTime;
public:
StopWatch () : elapsedTime(0) {}
void startTimer() {
elapsedTime = 0;
gettimeofday(&startTime, 0);
}
void stopNprintTimer() {
gettimeofday(&endTime, 0);
elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // compute sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // compute us to ms and add
std::cout << " Runtime: " << std::right << std::setw(6) << elapsedTime / 1000 << " sec."; // show in sec
}
};
int main(int argc, const char * argv[]) {
StopWatch watch;
std::cout << std::fixed << std::setprecision(3) << "" << std::endl;
size_t count = std::numeric_limits<int>::max() * 0.9079;
std::vector<float> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1.0f);
watch.startTimer();
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0); // change type of initial value here
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
float psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<float>::iterator>(heights.begin(), heights.end()), 0.0, // change type of initial value here
[](tbb::blocked_range<std::vector<float>::iterator> const& range, float init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<float>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
return 0;
}
我最后一个问题的答案:它们都会产生错误的结果,因为它们不是为大数的整数加法而设计的。切换到 int 解决了:
[...]
std::vector<int> heights(count);
std::cout << " Vector size: " << count << std::endl;
std::fill(heights.begin(), heights.end(), 1);
watch.startTimer();
int ssum = std::accumulate(heights.begin(), heights.end(), (int)0);
watch.stopNprintTimer();
std::cout << " Heights serial sum: " << std::right << std::setw(8) << ssum << std::endl;
watch.startTimer();
int psum = tbb::parallel_reduce(tbb::blocked_range<std::vector<int>::iterator>(heights.begin(), heights.end()), (int)0,
[](tbb::blocked_range<std::vector<int>::iterator> const& range, int init) {
return std::accumulate(range.begin(), range.end(), init);
}, std::plus<int>()
);
watch.stopNprintTimer();
std::cout << " Heights parallel sum: " << std::right << std::setw(8) << psum << std::endl;
[...]
结果:
Vector size: 1949700403
Runtime: 13.041 sec. Heights serial sum: 1949700403, correct
Runtime: 4.728 sec. Heights parallel sum: 1949700403, correct and almost 4x faster
您对 std::accumulate
的调用正在进行整数加法,然后在计算结束时将结果转换为 float
。为了累加浮点数,累加器应该是 float
*.
float ssum = std::accumulate(heights.begin(), heights.end(), 0.0f);
^^^^
* 或任何其他可以正确累积 float
的类型。
这可能会为您解决这个特殊问题:
Your call to std::accumulate is doing integer addition, then transforming the result to float at the end of the calculation.
但是浮点加法不是关联运算:
- 随着累积:(...((s+a1)+a2)+...)+an
- 使用 parralel_reduce:可以进行任何括号排列。
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
对于 'why?' 部分的其他正确答案,我还要补充一点,TBB 提供 parallel_deterministic_reduce
which guarantees reproducible results between two and more runs on the same data (but it still can differ with std::accumulate). See the blog 描述问题和确定性算法。
因此关于'Should I report an error to TBB developers?'部分,答案显然是否定的(除非你发现TBB方面的不足)。