得出平方根近似值的误差范围

Deriving error bounds for approximation of square root

this answer 关于四元数归一化的问题中,作者提供了一些计算平方根倒数的代码,使用 2.0 / (1.0 + qmagsq) 作为 1.0 / std::sqrt(qmagsq) 的近似值非常接近1:

double qmagsq = quat.square_magnitude();
if (std::abs(1.0 - qmagsq) < 2.107342e-08) {
    quat.scale (2.0 / (1.0 + qmagsq));
} else {
    quat.scale (1.0 / std::sqrt(qmagsq));
}

作者接着给出如下解释:

For values of qmagsq between 0 and 2, the error in this approximation is less than (1-qmagsq)^2 / 8. The magic number 2.107342e-08 represents where this error is more that half an ULP for IEEE doubles.

据推测,这是因为 sqrt(8 * 2^-(1+52) / 2) 大约为 2.10734243e-8,其中 2^-(1+52) / 2double.

精度的一半

对于介于 0 和 2 之间的 qmagsq 值,如何得出 (1-qmagsq)^2 / 8 作为此近似值误差的上限?

编辑:

有人指出,作者 does not actually holdqmagsq 的值在 0 和 1 之间提供的误差范围。因此,问题变得更加开放:

如何得出该近似值的误差界限,该误差界限可用于确定该近似值的误差范围小于 IEEE 双精度 ULP 的一半?

计算机已经变得足够快,可以在整个输入域上详尽地测试单参数函数的特定断言以获得单精度,并在相当小的间隔内测试双精度。实际界限通常约为 248 个测试向量左右。我假设使用 IEEE-754 兼容平台,默认舍入模式为最接近或偶数舍入,并且所有代码都严格遵守编译器可以召集的 IEEE-754(对于我的英特尔编译器,例如 /fp:strict

问题中的声明是快速替换在统一附近实现了 0.5 ulp 或更小的误差。换句话说,使用 IEEE-754 舍入模式将结果正确舍入为最接近或偶数。有两种方法可以测试该断言:使用 rsqrt() 的正确舍入实现作为参考并以 1 ulp 步骤迭代参数直到发现不匹配,或者使用多精度库作为参考,并在快速替代方案的 ulp 误差超过 0.5 ulp 时停止。在后一种情况下,我们需要的参考结果的精度是双精度精度的两倍多一点,以避免双舍入效应。对于平方根的倒数,一个2n+3位的引用就够了:

Cristina Iordache 和 David W. Matula:“关于除法、平方根、倒数和平方根倒数的无限精确舍入”。在 第 14 届 IEEE 计算机算术研讨会论文集,澳大利亚阿德莱德,1999 年 4 月 14 日至 16 日,第 233–240 页

下面的 ISO-C99 代码使用第一种方法。它从统一开始搜索,然后在零方向或二方向搜索,在第一个不匹配处停止。输出结果如下:

arg = (1.0 + 2.2204460492503131e-016)  quick_rsqrt =  0x1.0000000000000p+0 (1.0000000000000000e+000)  rsqrt_rn =  0x1.fffffffffffffp-1 (9.9999999999999989e-001)   
arg = (1.0 - 1.2166747276332046e-008)  quick_rsqrt =  0x1.0000001a20bd7p+0 (1.0000000060833736e+000)  rsqrt_rn =  0x1.0000001a20bd8p+0 (1.0000000060833738e+000)

我也尝试了第二种方法,得到了匹配的结果。快速替换在(1.0 + 2.2204460492503131e-16)处的误差为0.9995 ulps,在(1.0 - 1.2166747276332046e-8)处的误差为0.5002 ulps。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

/* function under test */
double quick_rsqrt (double a)
{
    return 2.0 / (1.0 + a);
}

/* starting approximation for reciprocal square root */
double simple_rsqrt (double a)
{
    return 1.0 / sqrt (a);
}

/* most significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t hi_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)(t >> 32);
}

/* least significant 32 bits of bit representation of IEEE-754 binary64 */
uint32_t lo_uint32_of_double (double a)
{
    uint64_t t;
    memcpy (&t, &a, sizeof t);
    return (uint32_t)t;
}

/* construct IEEE-754 binary64 from upper and lower half of its bit representation */
double mk_double_from_hilo_uint32 (uint32_t hi, uint32_t lo)
{
    double r;
    uint64_t t = ((uint64_t)hi << 32)  + ((uint64_t)lo);
    memcpy (&r, &t, sizeof r);
    return r;
}

/* reciprocal square root, rounded to-nearest-or-even */
double rsqrt_rn (double a)
{
    double y, h, l, e;
    uint32_t alo, ahi, temp;
    int32_t diff;
    
    ahi = hi_uint32_of_double (a);
    alo = lo_uint32_of_double (a);
    if ((ahi - 0x00100000u) < 0x7fe00000u) { // positive normals
        /* scale argument towards unity */
        temp = (ahi & 0x3fffffff) | 0x3fe00000;
        diff = temp - ahi; // exponent difference
        a = mk_double_from_hilo_uint32 (temp, alo); 
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        diff = diff >> 1; // adjust exponent; ensure arithmetic right shift which is not guaranteed by ISO-C99
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + diff, lo_uint32_of_double (y));
    } else if (a == 0.0) { // zeros
        a = mk_double_from_hilo_uint32 ((ahi & 0x80000000) | 0x7ff00000, 0x00000000);
    } else if (a < 0.0) { // negatives
        a = mk_double_from_hilo_uint32 (0xfff80000, 0x00000000);
    } else if (isinf (a)) { // infinities
        a = mk_double_from_hilo_uint32 (ahi & 0x80000000, 0x00000000);
    } else if (isnan (a)) { // NaNs
        a = a + a;
    } else { // positive subnormals
        /* scale argument towards unity */
        a = a * mk_double_from_hilo_uint32 (0x7fd00000, 0);
        /* initial rsqrt approximation */
        y = simple_rsqrt (a);
        /* refine reciprocal square root approximation */
        h = y * y;
        l = fma (y, y, -h);
        e = fma (l, -a, fma (h, -a, 1.0));
        /* round according to Peter Markstein, "IA-64 and Elementary Functions" */
        y = fma (fma (0.375, e, 0.5), e * y, y);
        /* scale result near unity to correct range */
        a = mk_double_from_hilo_uint32 (hi_uint32_of_double (y) + 0x1ff00000, lo_uint32_of_double (y));
    }
    return a;
}

int main (void)
{
    double x, ref, res;

    /* Try arguments greater than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 + %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    x - 1.0, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 2.0);
    } while (x < 2.0);

    /* Try arguments less than unity */
    x = 1.0;
    do {
        res = quick_rsqrt (x);
        ref = rsqrt_rn (x);
        if (res != ref) {
            printf ("arg = (1.0 - %23.16e)  quick_rsqrt = %21.13a (%23.16e)  rsqrt_rn = %21.13a (%23.16e)\n", 
                    1.0 - x, res, res, ref, ref);
            break;
        }
        x = nextafter (x, 0.0);
    } while (x > 0.0);

    return EXIT_SUCCESS;
}