是否有 acos() 函数的准确近似值?

Is there an accurate approximation of the acos() function?

我需要在计算着色器中使用双精度的 acos() 函数。由于GLSL中没有双精度的acos()内置函数,我尝试自己实现

起初,我实现了一个泰勒级数,例如 Wiki - Taylor series 中的方程式,其中包含预先计算的教员值。但这似乎在 1 左右不准确。最大误差约为 0.08,迭代 40 次。

我还实现了 this method,它在 CPU 上运行良好,最大误差为 -2.22045e-16,但我在着色器中实现它时遇到了一些麻烦。

目前,我正在使用来自 here where somebody posted his approximation functions on this 站点的 acos() 近似函数。我正在使用本网站最准确的功能,现在我得到的最大误差为 -7.60454e-08,但该误差也太高了。

我这个函数的代码是:

double myACOS(double x)
{
    double part[4];
    part[0] = 32768.0/2835.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+sqrt(2.0+2.0*x))));
    part[1] = 256.0/135.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+2.0*x)));
    part[2] = 8.0/135.0*sqrt(2.0-sqrt(2.0+2.0*x));
    part[3] = 1.0/2835.0*sqrt(2.0-2.0*x);
    return (part[0]-part[1]+part[2]-part[3]);
}

有没有人知道 acos() 的另一种实现方法,它非常准确并且 - 如果可能的话 - 很容易在着色器中实现?

一些系统信息:

我当前 'acos()' 的准确着色器实现是通常的泰勒级数和 Bence 的答案的混合。通过 40 次迭代,我从 math.h 获得 'acos()' 实现的准确度为 4.44089e-16。也许它不是最好的,但对我有用:

这里是:

double myASIN2(double x)
{
    double sum, tempExp;
    tempExp = x;
    double factor = 1.0;
    double divisor = 1.0;
    sum = x;
    for(int i = 0; i < 40; i++)
    {
        tempExp *= x*x;
        divisor += 2.0;
        factor *= (2.0*double(i) + 1.0)/((double(i)+1.0)*2.0);
        sum += factor*tempExp/divisor;
    }
    return sum;
}

double myASIN(double x)
{
    if(abs(x) <= 0.71)
        return myASIN2(x);
    else if( x > 0)
        return (PI/2.0-myASIN2(sqrt(1.0-(x*x))));
    else //x < 0 or x is NaN
        return (myASIN2(sqrt(1.0-(x*x)))-PI/2.0);

}

double myACOS(double x)
{
    return (PI/2.0 - myASIN(x));
}

任何意见,可以做得更好吗?例如,对因子的值使用 LUT,但在我的着色器中 'acos()' 仅被调用一次,因此不需要它。

NVIDIA GT 555M GPU 是具有 2.1 计算能力的设备,因此本机硬件支持基本的双精度运算,包括 fused multipy-add (FMA). As with all NVIDIA GPUs, the square root operation is emulated. I am familiar with CUDA, but not GLSL. According to version 4.3 of the GLSL specification, it exposes double-precision FMA as a function fma() and provides a double-precision square root, sqrt(). It is not clear whether the sqrt() implementation is correctly rounded according to IEEE-754 规则。我会假设它是,与 CUDA 类比。

不使用泰勒级数,而是希望使用多项式 minimax approximation, thereby reducing the number of terms required. Minimax approximations are typically generated using a variant of the Remez algorithm. To optimize both speed and accuracy, the use of FMA is essential. Evaluation of the polynomial with the Horner scheme is condusive to high accruacy. In the code below, a second order Horner scheme is utilized. As in DanceIgel's acos 可以方便地使用 asin 近似作为基本构建块与标准一起计算数学恒等式。

对于 400M 测试向量,使用以下代码看到的最大相对误差为 2.67e-16,而观察到的最大 ulp 误差为 1.442 ulp。

/* compute arcsin (a) for a in [-9/16, 9/16] */
double asin_core (double a)
{
    double q, r, s, t;

    s = a * a;
    q = s * s;
    r =             5.5579749017470502e-2;
    t =            -6.2027913464120114e-2;
    r = fma (r, q,  5.4224464349245036e-2);
    t = fma (t, q, -1.1326992890324464e-2);
    r = fma (r, q,  1.5268872539397656e-2);
    t = fma (t, q,  1.0493798473372081e-2);
    r = fma (r, q,  1.4106045900607047e-2);
    t = fma (t, q,  1.7339776384962050e-2);
    r = fma (r, q,  2.2372961589651054e-2);
    t = fma (t, q,  3.0381912707941005e-2);
    r = fma (r, q,  4.4642857881094775e-2);
    t = fma (t, q,  7.4999999991367292e-2);
    r = fma (r, s, t);
    r = fma (r, s,  1.6666666666670193e-1);
    t = a * s;
    r = fma (r, t, a);

    return r;
}

/* Compute arccosine (a), maximum error observed: 1.4316 ulp
   Double-precision factorization of π courtesy of Tor Myklebust
*/
double my_acos (double a)
{
    double r;

    r = (a > 0.0) ? -a : a; // avoid modifying the "sign" of NaNs
    if (r > -0.5625) {
        /* arccos(x) = pi/2 - arcsin(x) */
        r = fma (9.3282184640716537e-1, 1.6839188885261840e+0, asin_core (r));
    } else {
        /* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2)) */
        r = 2.0 * asin_core (sqrt (fma (0.5, r, 0.5)));
    }
    if (!(a > 0.0) && (a >= -1.0)) { // avoid modifying the "sign" of NaNs
        /* arccos (-x) = pi - arccos(x) */
        r = fma (1.8656436928143307e+0, 1.6839188885261840e+0, -r);
    }
    return r;
}