Intrinsics 与 Naive Vector 缩减结果的差异
Discrepancy in result of Intrinsics vs Naive Vector reduction
我一直在比较 Intrinsics vector reduction、naive vector reduction 和 using openmp pragmas vector reduction 的 运行 倍。但是,我发现在这些情况下结果有所不同。代码如下——(内在向量归约取自——Fastest way to do horizontal SSE vector sum (or other reduction))
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>
inline float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float hsum256_ps_avx(__m256 v) {
__m128 vlow = _mm256_castps256_ps128(v);
__m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
vlow = _mm_add_ps(vlow, vhigh); // add the low 128
return hsum_ps_sse3(vlow); // and inline the sse3 version, which is optimal for AVX
// (no wasted instructions, and all of them are the 4B minimum)
}
void reduceVector_Naive(std::vector<float> values){
float result = 0;
for(int i=0; i<int(1e8); i++){
result += values.at(i);
}
printf("Reduction Naive = %f \n", result);
}
void reduceVector_openmp(std::vector<float> values){
float result = 0;
#pragma omp simd reduction(+: result)
for(int i=0; i<int(1e8); i++){
result += values.at(i);
}
printf("Reduction OpenMP = %f \n", result);
}
void reduceVector_intrinsics(std::vector<float> values){
float result = 0;
float* data_ptr = values.data();
for(int i=0; i<1e8; i+=8){
result += hsum256_ps_avx(_mm256_loadu_ps(data_ptr + i));
}
printf("Reduction Intrinsics = %f \n", result);
}
int main(){
std::vector<float> values;
for(int i=0; i<1e8; i++){
values.push_back(1);
}
reduceVector_Naive(values);
reduceVector_openmp(values);
reduceVector_intrinsics(values);
// The result should be 1e8 in each case
}
然而,我的输出如下-
Reduction Naive = 16777216.000000
Reduction OpenMP = 16777216.000000
Reduction Intrinsics = 100000000.000000
可以看出,只有Intrinsic functions计算正确,其他都面临精度问题。我完全意识到由于四舍五入而使用 float 可能会遇到的精度问题,所以我的问题是,为什么内在函数得到正确的答案,即使它实际上也是浮点值算术。
我将其编译为 - g++ -mavx2 -march=native -O3 -fopenmp main.cpp
尝试使用版本 7.5.0
以及 10.3.0
TIA
朴素循环按 1.0
添加,并在 16777216.000000
处停止添加,因为 binary32 浮点数中没有足够的有效数字。
查看此问答:Why does a float variable stop incrementing at 16777216 in C#?
当你添加计算的水平总和时,它会添加 8.0
,所以从什么时候开始停止添加的数字大约是 16777216*8
= 134217728
,你只是不在您的实验中达到它。
我一直在比较 Intrinsics vector reduction、naive vector reduction 和 using openmp pragmas vector reduction 的 运行 倍。但是,我发现在这些情况下结果有所不同。代码如下——(内在向量归约取自——Fastest way to do horizontal SSE vector sum (or other reduction))
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>
inline float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float hsum256_ps_avx(__m256 v) {
__m128 vlow = _mm256_castps256_ps128(v);
__m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
vlow = _mm_add_ps(vlow, vhigh); // add the low 128
return hsum_ps_sse3(vlow); // and inline the sse3 version, which is optimal for AVX
// (no wasted instructions, and all of them are the 4B minimum)
}
void reduceVector_Naive(std::vector<float> values){
float result = 0;
for(int i=0; i<int(1e8); i++){
result += values.at(i);
}
printf("Reduction Naive = %f \n", result);
}
void reduceVector_openmp(std::vector<float> values){
float result = 0;
#pragma omp simd reduction(+: result)
for(int i=0; i<int(1e8); i++){
result += values.at(i);
}
printf("Reduction OpenMP = %f \n", result);
}
void reduceVector_intrinsics(std::vector<float> values){
float result = 0;
float* data_ptr = values.data();
for(int i=0; i<1e8; i+=8){
result += hsum256_ps_avx(_mm256_loadu_ps(data_ptr + i));
}
printf("Reduction Intrinsics = %f \n", result);
}
int main(){
std::vector<float> values;
for(int i=0; i<1e8; i++){
values.push_back(1);
}
reduceVector_Naive(values);
reduceVector_openmp(values);
reduceVector_intrinsics(values);
// The result should be 1e8 in each case
}
然而,我的输出如下-
Reduction Naive = 16777216.000000
Reduction OpenMP = 16777216.000000
Reduction Intrinsics = 100000000.000000
可以看出,只有Intrinsic functions计算正确,其他都面临精度问题。我完全意识到由于四舍五入而使用 float 可能会遇到的精度问题,所以我的问题是,为什么内在函数得到正确的答案,即使它实际上也是浮点值算术。
我将其编译为 - g++ -mavx2 -march=native -O3 -fopenmp main.cpp
尝试使用版本 7.5.0
以及 10.3.0
TIA
朴素循环按 1.0
添加,并在 16777216.000000
处停止添加,因为 binary32 浮点数中没有足够的有效数字。
查看此问答:Why does a float variable stop incrementing at 16777216 in C#?
当你添加计算的水平总和时,它会添加 8.0
,所以从什么时候开始停止添加的数字大约是 16777216*8
= 134217728
,你只是不在您的实验中达到它。