vrecpeq_f32 内在的参考实现?
Reference implementation of vrecpeq_f32 intrinsic?
有vrecpeq_f32
ARM NEON Intrinsic。
官方对vrecpeq_f32
的解释:https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrecpeq_f32 .
Floating-point Reciprocal Estimate. This instruction finds an approximate reciprocal estimate for each vector element in the source SIMD&FP register, places the result in a vector, and writes the vector to the destination SIMD&FP register.
但是,它对我来说仍然不准确。只是想知道我们是否可以在 C/C++ 中编写一个与 vrecpeq_f32
?
保持完全相同结果的参考实现
我试过调用 vrecpeq_f32
并得到结果:
float32x4_t v1 = {1, 2, 3, 4};
float32x4_t v_out = vrecpeq_f32(v1);//0.99805, 0.49902, 0.33301, 0.24951
很好奇为什么 1 的倒数是 0.99805 而不是 1.0。
P.S。我对如何通过一些技巧使用 NEON 内在函数来获得更好的精度倒数结果不感兴趣,例如一次或多次 Newton-Raphson 迭代。
ARM documention 提供了伪代码,详细说明了正在执行的确切算法。寻找使用定点 RecipEstimate
.
的 FPRecipEstimate
这可能看起来像很多代码,但其中很大一部分用于处理各种边缘情况、操作模式和元素大小。
Just wondering if we can write a reference implementation in C/C++ that keep exactly the same result as vrecpeq_f32?
好的!毕竟这归结为位操作,所以没有理由不可行。将其转换为 C++,同时删除大多数边缘情况处理以及扩展精度模式,如下所示:(参见 godbolt)
免责声明:这不是函数的完整实现,只是足以探索精度行为,假设有限归一化输入,没有特殊情况。不要将其放入代码库中,期望它与一般指令相匹配。
#include <iostream>
#include <cstring>
#include <iomanip>
// Convenience struct to deal with encoding and decoding ieee754 floats
struct float_parts {
explicit float_parts(float v);
explicit operator float() const;
std::uint32_t sign;
std::uint32_t fraction;
std::uint32_t exp;
};
// Adapted from:
// https://developer.arm.com/documentation/ddi0596/2021-03/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPRecipEstimate.2
// RecipEstimate()
// ===============
// Compute estimate of reciprocal of 9-bit fixed-point number.
//
// a is in range 256 .. 511 representing a number in
// the range 0.5 <= x < 1.0.
// result is in the range 256 .. 511 representing a
// number in the range 1.0 to 511/256
std::uint32_t RecipEstimate(std::uint32_t a) {
a = a*2+1;
std::uint32_t b = (1 << 19) / a;
return ( b + 1) / 2;
}
// FPRecipEstimate()
// =================
float FPRecipEstimate(float operand) {
// ([...],sign,[...]) = FPUnpack(operand, [...], [...]);
// fraction = operand<22:0> : Zeros(29);
// exp = UInt(operand<30:23>);
float_parts parts{operand};
// scaled = UInt('1':fraction<51:44>);
std::uint32_t scaled = 0x100 | ((parts.fraction >> 15) & 0xFF) ;
// when 32 result_exp = 253 - exp; // In range 253-254 = -1 to 253+1 = 254
parts.exp = 253 - parts.exp;
// // Scaled is in range 256 .. 511 representing a
// // fixed-point number in range [0.5 .. 1.0].
// estimate = RecipEstimate(scaled, increasedprecision);
std::uint32_t estimate = RecipEstimate(scaled);
// fraction = estimate<11:0> : Zeros(40);
parts.fraction = (estimate & 0xff ) << 15;
return float(parts);
}
int main() {
std::cout << std::setprecision(5)
<< FPRecipEstimate(1.0f) << "\n"
<< FPRecipEstimate(2.0f) << "\n"
<< FPRecipEstimate(3.0f) << "\n"
<< FPRecipEstimate(4.0f);
}
float_parts::float_parts(float v) {
std::uint32_t v_bits;
std::memcpy(&v_bits, &v, sizeof(float));
sign = (v_bits >> 31) & 0x1;
fraction = v_bits & ((1 << 23) - 1);
exp = (v_bits >> 23) & 0xff;
}
float_parts::operator float() const {
std::uint32_t v_bits =
((sign & 0x1) << 31) |
(fraction & ((1 << 23) - 1)) |
((exp & 0xff) << 23);
float result;
std::memcpy(&result, &v_bits, sizeof(float));
return result;
}
产生预期值:
0.99805
0.49902
0.33301
0.24951
有vrecpeq_f32
ARM NEON Intrinsic。
官方对vrecpeq_f32
的解释:https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrecpeq_f32 .
Floating-point Reciprocal Estimate. This instruction finds an approximate reciprocal estimate for each vector element in the source SIMD&FP register, places the result in a vector, and writes the vector to the destination SIMD&FP register.
但是,它对我来说仍然不准确。只是想知道我们是否可以在 C/C++ 中编写一个与 vrecpeq_f32
?
我试过调用 vrecpeq_f32
并得到结果:
float32x4_t v1 = {1, 2, 3, 4};
float32x4_t v_out = vrecpeq_f32(v1);//0.99805, 0.49902, 0.33301, 0.24951
很好奇为什么 1 的倒数是 0.99805 而不是 1.0。
P.S。我对如何通过一些技巧使用 NEON 内在函数来获得更好的精度倒数结果不感兴趣,例如一次或多次 Newton-Raphson 迭代。
ARM documention 提供了伪代码,详细说明了正在执行的确切算法。寻找使用定点 RecipEstimate
.
FPRecipEstimate
这可能看起来像很多代码,但其中很大一部分用于处理各种边缘情况、操作模式和元素大小。
Just wondering if we can write a reference implementation in C/C++ that keep exactly the same result as vrecpeq_f32?
好的!毕竟这归结为位操作,所以没有理由不可行。将其转换为 C++,同时删除大多数边缘情况处理以及扩展精度模式,如下所示:(参见 godbolt)
免责声明:这不是函数的完整实现,只是足以探索精度行为,假设有限归一化输入,没有特殊情况。不要将其放入代码库中,期望它与一般指令相匹配。
#include <iostream>
#include <cstring>
#include <iomanip>
// Convenience struct to deal with encoding and decoding ieee754 floats
struct float_parts {
explicit float_parts(float v);
explicit operator float() const;
std::uint32_t sign;
std::uint32_t fraction;
std::uint32_t exp;
};
// Adapted from:
// https://developer.arm.com/documentation/ddi0596/2021-03/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPRecipEstimate.2
// RecipEstimate()
// ===============
// Compute estimate of reciprocal of 9-bit fixed-point number.
//
// a is in range 256 .. 511 representing a number in
// the range 0.5 <= x < 1.0.
// result is in the range 256 .. 511 representing a
// number in the range 1.0 to 511/256
std::uint32_t RecipEstimate(std::uint32_t a) {
a = a*2+1;
std::uint32_t b = (1 << 19) / a;
return ( b + 1) / 2;
}
// FPRecipEstimate()
// =================
float FPRecipEstimate(float operand) {
// ([...],sign,[...]) = FPUnpack(operand, [...], [...]);
// fraction = operand<22:0> : Zeros(29);
// exp = UInt(operand<30:23>);
float_parts parts{operand};
// scaled = UInt('1':fraction<51:44>);
std::uint32_t scaled = 0x100 | ((parts.fraction >> 15) & 0xFF) ;
// when 32 result_exp = 253 - exp; // In range 253-254 = -1 to 253+1 = 254
parts.exp = 253 - parts.exp;
// // Scaled is in range 256 .. 511 representing a
// // fixed-point number in range [0.5 .. 1.0].
// estimate = RecipEstimate(scaled, increasedprecision);
std::uint32_t estimate = RecipEstimate(scaled);
// fraction = estimate<11:0> : Zeros(40);
parts.fraction = (estimate & 0xff ) << 15;
return float(parts);
}
int main() {
std::cout << std::setprecision(5)
<< FPRecipEstimate(1.0f) << "\n"
<< FPRecipEstimate(2.0f) << "\n"
<< FPRecipEstimate(3.0f) << "\n"
<< FPRecipEstimate(4.0f);
}
float_parts::float_parts(float v) {
std::uint32_t v_bits;
std::memcpy(&v_bits, &v, sizeof(float));
sign = (v_bits >> 31) & 0x1;
fraction = v_bits & ((1 << 23) - 1);
exp = (v_bits >> 23) & 0xff;
}
float_parts::operator float() const {
std::uint32_t v_bits =
((sign & 0x1) << 31) |
(fraction & ((1 << 23) - 1)) |
((exp & 0xff) << 23);
float result;
std::memcpy(&result, &v_bits, sizeof(float));
return result;
}
产生预期值:
0.99805
0.49902
0.33301
0.24951