C 中三角函数的单精度参数缩减
Single precision argument reduction for trigonometric functions in C
我已经在 C 中用单精度(32 位浮点)计算了一些三角函数(sin、cos、arctan)的近似值。它们的精度约为 +/- 2 ulp。
我的目标设备不支持任何<cmath>
或<math.h>
方法。它不提供 FMA,而是提供 MAC ALU。 ALU 和 LU 以 32 位格式计算。
我的 arctan 近似实际上是 approximation of N.juffa 的修改版本,它在整个范围内近似 arctan。正弦和余弦函数在 [-pi,pi] 范围内精确到 2 ulp。
我现在的目标是为正弦和余弦提供更大的输入范围(尽可能大,理想情况下 [FLT_MIN,FLT_MAX]),这使我减少了自变量。
我目前正在阅读不同的论文,例如 ARGUMENT REDUCTION FOR HUGE ARGUMENTS:
K.C.Ng 或关于此的论文 new argument reduction algorithm 的最后一点很好,但我无法从中推导出实现。
另外,我想提两个与相关问题相关的 Whosebug 问题:有一个 approach with matlab and c++ 是基于我链接的第一篇论文。它实际上是使用 matlab、cmath 方法并将输入限制为 [0,20.000]。另一个已经在评论中提到了。这是一种在 C 中实现 sin 和 cos 的方法,使用了我无法使用的各种 c 库。由于这两个帖子已经有好几年了,可能会有一些新发现。
貌似这种情况下用的最多的算法是把2/pi的个数精确到需要的位数来存储,这样既能准确计算模数又能避免抵消。我的设备不提供大型 DMEM,这意味着具有数百位的大型查找表是不可能的。这个过程实际上在 this 参考文献的第 70 页上有描述,顺便说一句,它提供了很多关于浮点数学的有用信息。
所以我的问题是:是否有另一种有效的方法来减少正弦和余弦获得单精度的参数,避免大型 LUT?上面提到的论文实际上专注于双精度并且最多使用1000位数字,这不适合我的用例。
实际上我还没有找到任何 C 语言的实现,也没有找到针对单精度计算的实现,我将不胜感激任何类型的提示/链接/示例...
以下代码基于我的 in which I demonstrated how to perform a fairly accurate argument reduction for trigonometric functions by using the Cody-Waite method of split constants for arguments small in magnitude, and the Payne-Hanek method for arguments large in magnitude. For details on the Payne-Hanek algorithm see there, for details on the Cody-Waite algorithm see this
这里我做了必要的调整以适应提问者平台的限制,不支持64位类型,不支持融合multiply-add,辅助函数来自math.h
不可用。我假设 float
映射到 IEEE-754 binary32
格式,并且有一种方法可以将 re-interpret 这样的 32 位浮点数作为 32 位无符号整数,反之亦然。我已经通过标准可移植惯用语实现了这个 re-interpretation,即通过使用 memcpy()
,但可以选择适合未指定目标平台的其他方法,例如内联汇编,machine-specific内在函数,或不稳定的联合。
由于这段代码基本上是我以前的代码移植到一个更严格的环境,它可能缺乏专门针对该环境的 de novo 设计的优雅。我基本上用一些位旋转替换了 math.h
中的 frexp()
辅助函数,用 32 位整数对模拟 64 位整数计算,用 32 位 double-precision 计算替换 fixed-point 计算(效果比我预期的要好得多),并用未融合的等效项替换了所有 FMA。
Re-working 参数缩减的 Cody-Waite 部分花费了大量的工作。显然,在没有可用的 FMA 的情况下,我们需要确保常数 π/2 的组成部分中有足够数量的尾随零位(最低有效位除外),以确保乘积是准确的。我花了几个小时实验性地弄清楚一个特定的分割,它提供准确的结果,但也将切换点推到尽可能高的 Payne-Hanek 方法。
指定 USE_FMA = 1
时,测试应用程序的输出在使用 high-quality 数学库编译时应类似于:
Testing sinf ... PASSED. max ulp err = 1.493253 diffsum = 337633490
Testing cosf ... PASSED. max ulp err = 1.495098 diffsum = 342020968
使用 USE_FMA = 0
时,准确性略有变差:
Testing sinf ... PASSED. max ulp err = 1.498012 diffsum = 359702532
Testing cosf ... PASSED. max ulp err = 1.504061 diffsum = 364682650
diffsum
输出是总体准确度的 粗略 指标,这里显示大约 90% 的输入会产生正确舍入的 single-precision 响应.
请注意,使用编译器提供的最严格的 floating-point 设置和对 IEEE-754 的最高遵守程度来编译代码非常重要。对于我用来开发和测试此代码的 Intel 编译器,可以通过使用 /fp:strict
进行编译来实现。此外,用于参考的数学库的质量对于准确评估此 single-precision 代码的 ulp 错误至关重要。英特尔编译器附带一个数学库,它提供 double-precision 初等数学函数,在 HA(高精度)变体中误差略高于 0.5 ulp。使用 multi-precision 参考库可能更可取,但在这里会减慢我的速度。
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h> // for memcpy()
#include <math.h> // for test purposes, and when PORTABLE=1 or USE_FMA=1
#define USE_FMA (0) // use fmaf() calls for arithmetic
#define PORTABLE (0) // allow helper functions from math.h
#define HAVE_U64 (0) // 64-bit integer type available
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0
#if USE_FMA
#define SIN_RED_SWITCHOVER (117435.992f)
#define COS_RED_SWITCHOVER (71476.0625f)
#define MAX_DIFF (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER (3.921875f)
#define COS_RED_SWITCHOVER (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER (201.15625f)
#define COS_RED_SWITCHOVER (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF (2)
#endif // USE_FMA
/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
/* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
#if HAVE_U64
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#else // HAVE_U64
/* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
uint16_t a_lo = (uint16_t)a;
uint16_t a_hi = a >> 16;
uint16_t b_lo = (uint16_t)b;
uint16_t b_hi = b >> 16;
uint32_t p0 = (uint32_t)a_lo * b_lo;
uint32_t p1 = (uint32_t)a_lo * b_hi;
uint32_t p2 = (uint32_t)a_hi * b_lo;
uint32_t p3 = (uint32_t)a_hi * b_hi;
uint32_t t = (p0 >> 16) + p1;
return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
}
#endif // HAVE_U64
/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] =
{
0x28be60db,
0x9391054a,
0x7f09d5f4,
0x7d4d3770,
0x36d8a566,
0x4f10e410
};
/* Reduce a trig function argument using the slow Payne-Hanek method */
float trig_red_slowpath_f (float a, int *quadrant)
{
uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
int32_t e, q;
float r;
#if PORTABLE
ia = (uint32_t)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32
#else // PORTABLE
ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
/* compute product x * 2/pi in 2.62 fixed-point format */
i = (uint32_t)e >> 5;
e = (uint32_t)e & 31;
hi = i ? two_over_pi_f [i-1] : 0;
mid = two_over_pi_f [i+0];
lo = two_over_pi_f [i+1];
tmp = two_over_pi_f [i+2];
if (e) {
hi = (hi << e) | (mid >> (32 - e));
mid = (mid << e) | (lo >> (32 - e));
lo = (lo << e) | (tmp >> (32 - e));
}
/* compute 64-bit product phi:plo */
phi = 0;
l = ia * lo;
h = umul32_hi (ia, lo);
plo = phi + l;
phi = h + (plo < l);
l = ia * mid;
h = umul32_hi (ia, mid);
plo = phi + l;
phi = h + (plo < l);
l = ia * hi;
phi = phi + l;
/* split fixed-point result into integer and fraction portions */
q = phi >> 30; // integral portion = quadrant<1:0>
phi = phi & 0x3fffffff; // fraction
if (phi & 0x20000000) { // fraction >= 0.5
phi = phi - 0x40000000; // fraction - 1.0
q = q + 1;
}
/* compute remainder of x / (pi/2) */
#if USE_FMA
float phif, plof, chif, clof, thif, tlof;
phif = 1.34217728e+8f * (float)(int32_t)(phi & 0xffffffe0); // 0x1.0p27
plof = (float)((plo >> 5) | (phi << (32-5)));
thif = phif + plof;
plof = (phif - thif) + plof;
phif = thif;
chif = 1.08995894e-17f; // 0x1.921fb6p-57 // (1.5707963267948966 * 0x1.0p-57)_hi
clof = -3.03308686e-25f; // -0x1.777a5cp-82 // (1.5707963267948966 * 0x1.0p-57)_lo
thif = phif * chif;
tlof = fmaf (phif, chif, -thif);
tlof = fmaf (phif, clof, tlof);
tlof = fmaf (plof, chif, tlof);
r = thif + tlof;
#else // USE_FMA
/* record sign of fraction */
uint32_t s = phi & 0x80000000;
/* take absolute value of fraction */
if ((int32_t)phi < 0) {
phi = ~phi;
plo = 0 - plo;
phi += (plo == 0);
}
/* normalize fraction */
e = 0;
while ((int32_t)phi > 0) {
phi = (phi << 1) | (plo >> 31);
plo = plo << 1;
e--;
}
/* multiply 32 high-order bits of fraction with pi/2 */
phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
/* normalize product */
if ((int32_t)phi > 0) {
phi = phi << 1;
e--;
}
/* round and convert to floating point */
uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
r = uint32_as_float (ri);
#endif // USE_FMA
if (a < 0.0f) {
r = -r;
q = -q;
}
*quadrant = q;
return r;
}
/* Argument reduction for trigonometric functions that reduces the argument
to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns
-0.0f for an input of -0.0f
*/
float trig_red_f (float a, float switch_over, int *q)
{
float j, r;
if (fabsf (a) > switch_over) {
/* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
*/
r = trig_red_slowpath_f (a, q);
} else {
/* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
for the Elementary Functions", Prentice-Hall 1980
*/
#if USE_FMA
j = fmaf (a, 6.36619747e-1f, 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
#else // USE_FMA
j = (a * 6.36619747e-1f + 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
#if CW_STAGES == 2
r = a - j * 1.57079625e+00f; // 0x1.921fb4p+0 // pio2_high
r = r - j * 7.54979013e-08f; // 0x1.4442d2p-24 // pio2_low
#elif CW_STAGES == 3
r = a - j * 1.57078552e+00f; // 0x1.921f00p+00 // pio2_high
r = r - j * 1.08043314e-05f; // 0x1.6a8880p-17 // pio2_mid
r = r - j * 2.56334407e-12f; // 0x1.68c234p-39 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
*q = (int)j;
}
return r;
}
/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
Returns -0.0f for an argument of -0.0f
Polynomial approximation based on T. Myklebust, "Computing accurate
Horner form approximations to special functions in finite precision
arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
float sinf_poly (float a, float s)
{
float r, t;
#if USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
r = fmaf (r, s, 8.33338592e-3f); // 0x1.111182p-07
r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
t = fmaf (a, s, 0.0f); // ensure -0 is passed through
r = fmaf (r, t, a);
#else // USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = r * s - 1.98559923e-4f; // -0x1.a0690cp-13
r = r * s + 8.33338592e-3f; // 0x1.111182p-07
r = r * s - 1.66666672e-1f; // -0x1.555556p-03
t = a * s + 0.0f; // ensure -0 is passed through
r = r * t + a;
#endif // USE_FMA
return r;
}
/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
float r;
#if USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
r = fmaf (r, s, 4.16666567e-2f); // 0x1.555550p-05
r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
r = fmaf (r, s, 1.00000000e+0f); // 0x1.000000p+00
#else // USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = r * s - 1.38877297e-3f; // -0x1.6c0efap-10
r = r * s + 4.16666567e-2f; // 0x1.555550p-05
r = r * s - 5.00000000e-1f; // -0x1.000000p-01
r = r * s + 1.00000000e+0f; // 0x1.000000p+00
#endif // USE_FMA
return r;
}
/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
float r, s;
s = a * a;
r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
if (i & 2) {
r = 0.0f - r; // don't change "sign" of NaNs
}
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.495098 */
float my_sinf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i);
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i + 1);
return r;
}
/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
float arg, res, reff;
uint32_t argi, resi, refi;
int64_t diff, diffsum;
double ref, ulp, maxulp;
printf ("Testing sinf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_sinf (arg);
ref = sin ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
printf ("Testing cosf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_cosf (arg);
ref = cos ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
diffsum = diffsum + diff;
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
return EXIT_SUCCESS;
}
有一个 thread on Mathematics forum where user J. M. ain't a mathematician introduced improved Taylor/Padé idea to approximate cos and sin functions in range [-pi,pi]. Here's sine version 已翻译成 C++。这种近似不如库 std::sin() 函数快,但可能值得检查 SSE/AVX/FMA 实现是否对速度有足够的帮助。
我没有针对库 sin() 或 cos() 函数测试 ULP 错误,但通过 Julia Function Accuracy Test 工具,它看起来像是一个很好的近似方法(将下面的代码添加到属于的 runtest.jl 模块到 Julia 测试套件):
function test_sine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while(k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (2*r)/(1+r*r)
end
function test_cosine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while (k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (1-r*r)/(1+r*r)
end
pii = 3.141592653589793238462643383279502884
MAX_SIN(n::Val{pii}, ::Type{Float16}) = 3.1415926535897932f0
MAX_SIN(n::Val{pii}, ::Type{Float32}) = 3.1415926535897932f0
#MAX_SIN(n::Val{pii}, ::Type{Float64}) = 3.141592653589793238462643383279502884
MIN_SIN(n::Val{pii}, ::Type{Float16}) = -3.1415926535897932f0
MIN_SIN(n::Val{pii}, ::Type{Float32}) = -3.1415926535897932f0
#MIN_SIN(n::Val{pii}, ::Type{Float64}) = -3.141592653589793238462643383279502884
for (func, base) in (sin=>Val(pii), test_sine=>Val(pii), cos=>Val(pii), test_cosine=>Val(pii))
for T in (Float16, Float32)
xx = range(MIN_SIN(base,T), MAX_SIN(base,T), length = 10^6);
test_acc(func, xx)
end
end
[-pi,pi] 范围内的近似值和 sin() 和 cos() 的结果:
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1
我已经在 C 中用单精度(32 位浮点)计算了一些三角函数(sin、cos、arctan)的近似值。它们的精度约为 +/- 2 ulp。
我的目标设备不支持任何<cmath>
或<math.h>
方法。它不提供 FMA,而是提供 MAC ALU。 ALU 和 LU 以 32 位格式计算。
我的 arctan 近似实际上是 approximation of N.juffa 的修改版本,它在整个范围内近似 arctan。正弦和余弦函数在 [-pi,pi] 范围内精确到 2 ulp。
我现在的目标是为正弦和余弦提供更大的输入范围(尽可能大,理想情况下 [FLT_MIN,FLT_MAX]),这使我减少了自变量。
我目前正在阅读不同的论文,例如 ARGUMENT REDUCTION FOR HUGE ARGUMENTS: K.C.Ng 或关于此的论文 new argument reduction algorithm 的最后一点很好,但我无法从中推导出实现。
另外,我想提两个与相关问题相关的 Whosebug 问题:有一个 approach with matlab and c++ 是基于我链接的第一篇论文。它实际上是使用 matlab、cmath 方法并将输入限制为 [0,20.000]。另一个已经在评论中提到了。这是一种在 C 中实现 sin 和 cos 的方法,使用了我无法使用的各种 c 库。由于这两个帖子已经有好几年了,可能会有一些新发现。
貌似这种情况下用的最多的算法是把2/pi的个数精确到需要的位数来存储,这样既能准确计算模数又能避免抵消。我的设备不提供大型 DMEM,这意味着具有数百位的大型查找表是不可能的。这个过程实际上在 this 参考文献的第 70 页上有描述,顺便说一句,它提供了很多关于浮点数学的有用信息。
所以我的问题是:是否有另一种有效的方法来减少正弦和余弦获得单精度的参数,避免大型 LUT?上面提到的论文实际上专注于双精度并且最多使用1000位数字,这不适合我的用例。
实际上我还没有找到任何 C 语言的实现,也没有找到针对单精度计算的实现,我将不胜感激任何类型的提示/链接/示例...
以下代码基于我的
这里我做了必要的调整以适应提问者平台的限制,不支持64位类型,不支持融合multiply-add,辅助函数来自math.h
不可用。我假设 float
映射到 IEEE-754 binary32
格式,并且有一种方法可以将 re-interpret 这样的 32 位浮点数作为 32 位无符号整数,反之亦然。我已经通过标准可移植惯用语实现了这个 re-interpretation,即通过使用 memcpy()
,但可以选择适合未指定目标平台的其他方法,例如内联汇编,machine-specific内在函数,或不稳定的联合。
由于这段代码基本上是我以前的代码移植到一个更严格的环境,它可能缺乏专门针对该环境的 de novo 设计的优雅。我基本上用一些位旋转替换了 math.h
中的 frexp()
辅助函数,用 32 位整数对模拟 64 位整数计算,用 32 位 double-precision 计算替换 fixed-point 计算(效果比我预期的要好得多),并用未融合的等效项替换了所有 FMA。
Re-working 参数缩减的 Cody-Waite 部分花费了大量的工作。显然,在没有可用的 FMA 的情况下,我们需要确保常数 π/2 的组成部分中有足够数量的尾随零位(最低有效位除外),以确保乘积是准确的。我花了几个小时实验性地弄清楚一个特定的分割,它提供准确的结果,但也将切换点推到尽可能高的 Payne-Hanek 方法。
指定 USE_FMA = 1
时,测试应用程序的输出在使用 high-quality 数学库编译时应类似于:
Testing sinf ... PASSED. max ulp err = 1.493253 diffsum = 337633490
Testing cosf ... PASSED. max ulp err = 1.495098 diffsum = 342020968
使用 USE_FMA = 0
时,准确性略有变差:
Testing sinf ... PASSED. max ulp err = 1.498012 diffsum = 359702532
Testing cosf ... PASSED. max ulp err = 1.504061 diffsum = 364682650
diffsum
输出是总体准确度的 粗略 指标,这里显示大约 90% 的输入会产生正确舍入的 single-precision 响应.
请注意,使用编译器提供的最严格的 floating-point 设置和对 IEEE-754 的最高遵守程度来编译代码非常重要。对于我用来开发和测试此代码的 Intel 编译器,可以通过使用 /fp:strict
进行编译来实现。此外,用于参考的数学库的质量对于准确评估此 single-precision 代码的 ulp 错误至关重要。英特尔编译器附带一个数学库,它提供 double-precision 初等数学函数,在 HA(高精度)变体中误差略高于 0.5 ulp。使用 multi-precision 参考库可能更可取,但在这里会减慢我的速度。
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h> // for memcpy()
#include <math.h> // for test purposes, and when PORTABLE=1 or USE_FMA=1
#define USE_FMA (0) // use fmaf() calls for arithmetic
#define PORTABLE (0) // allow helper functions from math.h
#define HAVE_U64 (0) // 64-bit integer type available
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0
#if USE_FMA
#define SIN_RED_SWITCHOVER (117435.992f)
#define COS_RED_SWITCHOVER (71476.0625f)
#define MAX_DIFF (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER (3.921875f)
#define COS_RED_SWITCHOVER (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER (201.15625f)
#define COS_RED_SWITCHOVER (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF (2)
#endif // USE_FMA
/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
/* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
#if HAVE_U64
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#else // HAVE_U64
/* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
uint16_t a_lo = (uint16_t)a;
uint16_t a_hi = a >> 16;
uint16_t b_lo = (uint16_t)b;
uint16_t b_hi = b >> 16;
uint32_t p0 = (uint32_t)a_lo * b_lo;
uint32_t p1 = (uint32_t)a_lo * b_hi;
uint32_t p2 = (uint32_t)a_hi * b_lo;
uint32_t p3 = (uint32_t)a_hi * b_hi;
uint32_t t = (p0 >> 16) + p1;
return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
}
#endif // HAVE_U64
/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] =
{
0x28be60db,
0x9391054a,
0x7f09d5f4,
0x7d4d3770,
0x36d8a566,
0x4f10e410
};
/* Reduce a trig function argument using the slow Payne-Hanek method */
float trig_red_slowpath_f (float a, int *quadrant)
{
uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
int32_t e, q;
float r;
#if PORTABLE
ia = (uint32_t)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32
#else // PORTABLE
ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
/* compute product x * 2/pi in 2.62 fixed-point format */
i = (uint32_t)e >> 5;
e = (uint32_t)e & 31;
hi = i ? two_over_pi_f [i-1] : 0;
mid = two_over_pi_f [i+0];
lo = two_over_pi_f [i+1];
tmp = two_over_pi_f [i+2];
if (e) {
hi = (hi << e) | (mid >> (32 - e));
mid = (mid << e) | (lo >> (32 - e));
lo = (lo << e) | (tmp >> (32 - e));
}
/* compute 64-bit product phi:plo */
phi = 0;
l = ia * lo;
h = umul32_hi (ia, lo);
plo = phi + l;
phi = h + (plo < l);
l = ia * mid;
h = umul32_hi (ia, mid);
plo = phi + l;
phi = h + (plo < l);
l = ia * hi;
phi = phi + l;
/* split fixed-point result into integer and fraction portions */
q = phi >> 30; // integral portion = quadrant<1:0>
phi = phi & 0x3fffffff; // fraction
if (phi & 0x20000000) { // fraction >= 0.5
phi = phi - 0x40000000; // fraction - 1.0
q = q + 1;
}
/* compute remainder of x / (pi/2) */
#if USE_FMA
float phif, plof, chif, clof, thif, tlof;
phif = 1.34217728e+8f * (float)(int32_t)(phi & 0xffffffe0); // 0x1.0p27
plof = (float)((plo >> 5) | (phi << (32-5)));
thif = phif + plof;
plof = (phif - thif) + plof;
phif = thif;
chif = 1.08995894e-17f; // 0x1.921fb6p-57 // (1.5707963267948966 * 0x1.0p-57)_hi
clof = -3.03308686e-25f; // -0x1.777a5cp-82 // (1.5707963267948966 * 0x1.0p-57)_lo
thif = phif * chif;
tlof = fmaf (phif, chif, -thif);
tlof = fmaf (phif, clof, tlof);
tlof = fmaf (plof, chif, tlof);
r = thif + tlof;
#else // USE_FMA
/* record sign of fraction */
uint32_t s = phi & 0x80000000;
/* take absolute value of fraction */
if ((int32_t)phi < 0) {
phi = ~phi;
plo = 0 - plo;
phi += (plo == 0);
}
/* normalize fraction */
e = 0;
while ((int32_t)phi > 0) {
phi = (phi << 1) | (plo >> 31);
plo = plo << 1;
e--;
}
/* multiply 32 high-order bits of fraction with pi/2 */
phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
/* normalize product */
if ((int32_t)phi > 0) {
phi = phi << 1;
e--;
}
/* round and convert to floating point */
uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
r = uint32_as_float (ri);
#endif // USE_FMA
if (a < 0.0f) {
r = -r;
q = -q;
}
*quadrant = q;
return r;
}
/* Argument reduction for trigonometric functions that reduces the argument
to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns
-0.0f for an input of -0.0f
*/
float trig_red_f (float a, float switch_over, int *q)
{
float j, r;
if (fabsf (a) > switch_over) {
/* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
*/
r = trig_red_slowpath_f (a, q);
} else {
/* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
for the Elementary Functions", Prentice-Hall 1980
*/
#if USE_FMA
j = fmaf (a, 6.36619747e-1f, 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
#else // USE_FMA
j = (a * 6.36619747e-1f + 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
#if CW_STAGES == 2
r = a - j * 1.57079625e+00f; // 0x1.921fb4p+0 // pio2_high
r = r - j * 7.54979013e-08f; // 0x1.4442d2p-24 // pio2_low
#elif CW_STAGES == 3
r = a - j * 1.57078552e+00f; // 0x1.921f00p+00 // pio2_high
r = r - j * 1.08043314e-05f; // 0x1.6a8880p-17 // pio2_mid
r = r - j * 2.56334407e-12f; // 0x1.68c234p-39 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
*q = (int)j;
}
return r;
}
/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
Returns -0.0f for an argument of -0.0f
Polynomial approximation based on T. Myklebust, "Computing accurate
Horner form approximations to special functions in finite precision
arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
float sinf_poly (float a, float s)
{
float r, t;
#if USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
r = fmaf (r, s, 8.33338592e-3f); // 0x1.111182p-07
r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
t = fmaf (a, s, 0.0f); // ensure -0 is passed through
r = fmaf (r, t, a);
#else // USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = r * s - 1.98559923e-4f; // -0x1.a0690cp-13
r = r * s + 8.33338592e-3f; // 0x1.111182p-07
r = r * s - 1.66666672e-1f; // -0x1.555556p-03
t = a * s + 0.0f; // ensure -0 is passed through
r = r * t + a;
#endif // USE_FMA
return r;
}
/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
float r;
#if USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
r = fmaf (r, s, 4.16666567e-2f); // 0x1.555550p-05
r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
r = fmaf (r, s, 1.00000000e+0f); // 0x1.000000p+00
#else // USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = r * s - 1.38877297e-3f; // -0x1.6c0efap-10
r = r * s + 4.16666567e-2f; // 0x1.555550p-05
r = r * s - 5.00000000e-1f; // -0x1.000000p-01
r = r * s + 1.00000000e+0f; // 0x1.000000p+00
#endif // USE_FMA
return r;
}
/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
float r, s;
s = a * a;
r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
if (i & 2) {
r = 0.0f - r; // don't change "sign" of NaNs
}
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.495098 */
float my_sinf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i);
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i + 1);
return r;
}
/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
float arg, res, reff;
uint32_t argi, resi, refi;
int64_t diff, diffsum;
double ref, ulp, maxulp;
printf ("Testing sinf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_sinf (arg);
ref = sin ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
printf ("Testing cosf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_cosf (arg);
ref = cos ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror @ %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
diffsum = diffsum + diff;
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
return EXIT_SUCCESS;
}
有一个 thread on Mathematics forum where user J. M. ain't a mathematician introduced improved Taylor/Padé idea to approximate cos and sin functions in range [-pi,pi]. Here's sine version 已翻译成 C++。这种近似不如库 std::sin() 函数快,但可能值得检查 SSE/AVX/FMA 实现是否对速度有足够的帮助。
我没有针对库 sin() 或 cos() 函数测试 ULP 错误,但通过 Julia Function Accuracy Test 工具,它看起来像是一个很好的近似方法(将下面的代码添加到属于的 runtest.jl 模块到 Julia 测试套件):
function test_sine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while(k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (2*r)/(1+r*r)
end
function test_cosine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while (k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (1-r*r)/(1+r*r)
end
pii = 3.141592653589793238462643383279502884
MAX_SIN(n::Val{pii}, ::Type{Float16}) = 3.1415926535897932f0
MAX_SIN(n::Val{pii}, ::Type{Float32}) = 3.1415926535897932f0
#MAX_SIN(n::Val{pii}, ::Type{Float64}) = 3.141592653589793238462643383279502884
MIN_SIN(n::Val{pii}, ::Type{Float16}) = -3.1415926535897932f0
MIN_SIN(n::Val{pii}, ::Type{Float32}) = -3.1415926535897932f0
#MIN_SIN(n::Val{pii}, ::Type{Float64}) = -3.141592653589793238462643383279502884
for (func, base) in (sin=>Val(pii), test_sine=>Val(pii), cos=>Val(pii), test_cosine=>Val(pii))
for T in (Float16, Float32)
xx = range(MIN_SIN(base,T), MAX_SIN(base,T), length = 10^6);
test_acc(func, xx)
end
end
[-pi,pi] 范围内的近似值和 sin() 和 cos() 的结果:
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1