使用 AVX 最快实现指数函数

Fastest Implementation of Exponential Function Using AVX

我正在寻找在 AVX 元素(单精度浮点)上运行的指数函数的高效(快速)逼近。即 - __m256 _mm256_exp_ps( __m256 x ) 没有 SVML。

相对精度应该类似于 ~1e-6,或 ~20 个尾数位(2^20 中的 1 个部分)。

如果它是用带有 Intel 内在函数的 C 风格编写的,我会很高兴。
代码应该是可移植的(Windows、macOS、Linux、MSVC、ICC、GCC 等)。


这与类似,但该问题的查找速度非常快且精度较低(那里的当前答案给出了大约 1e-3 精度)。

此外,这个问题正在寻找 AVX/AVX2(和 FMA)。但请注意,这两个问题的答案很容易在 SSE4 __m128 或 AVX2 __m256 之间移植,因此未来的读者应根据所需的精度/性能权衡进行选择。

你可以approximate the exponent yourself with Taylor series:

exp(z) = 1 + z + pow(z,2)/2 + pow(z,3)/6 + pow(z,4)/24 + ...

为此,您只需要从 AVX 进行加法和乘法运算。如果 hard-coded 然后乘以而不是除以,则 1/2、1/6、1/24 等系数更快。

根据您的精度要求,取尽可能多的序列成员。请注意,您会得到相对误差:对于小的 z,绝对值可能是 1e-6,但对于大的 z,绝对值会大于 1e-6,仍然abs(E-E1)/abs(E) - 1 小于 1e-6(其中 E 是精确指数,E1 是近似值)。

更新:正如@Peter Cordes 在评论中提到的,可以通过分离整数部分和小数部分的幂来提高精度,通过操纵二进制 float 表示的指数字段来处理整数部分(其中基于 2^x,而不是 e^x)。那么你的泰勒级数只需要在一个小范围内最小化误差。

avx_mathfun 中的 exp 函数使用范围缩减结合切比雪夫 approximation-like 多项式与 AVX 指令并行计算 8 exp-s。使用正确的编译器设置以确保 addpsmulps 在可能的情况下融合到 FMA 指令。

将原始 exp 代码从 avx_mathfun 改编为 portable(跨不同编译器)C / AVX2 内在代码非常简单。原始代码使用 gcc 样式对齐属性和巧妙的宏。修改后的代码,改用标准的_mm256_set1_ps(),在小测试代码和table下面。修改后的代码需要AVX2.

以下代码用于简单测试:

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}

输出似乎没问题:

i = 0, x = 1.000000e+00, y = 2.718282e+00 
i = 1, x = 2.000000e+00, y = 7.389056e+00 
i = 2, x = 3.000000e+00, y = 2.008554e+01 
i = 3, x = 4.000000e+00, y = 5.459815e+01 
i = 4, x = 5.000000e+00, y = 1.484132e+02 
i = 5, x = 6.000000e+00, y = 4.034288e+02 
i = 6, x = 7.000000e+00, y = 1.096633e+03 
i = 7, x = 8.000000e+00, y = 2.980958e+03 

修改后的代码(AVX2)为:

#include <stdio.h>
#include <immintrin.h>
/*     gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c    */

__m256 exp256_ps(__m256 x) {
/* Modified code. The original code is here: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)
*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc. The modified code requires AVX2
*/
__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341);
__m256   cephes_exp_C1 = _mm256_set1_ps(0.693359375);
__m256   cephes_exp_C2 = _mm256_set1_ps(-2.12194440e-4);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   tmp           = _mm256_setzero_ps(), fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx    = _mm256_mul_ps(x, cephes_LOG2EF);
        fx    = _mm256_add_ps(fx, _mm256_set1_ps(0.5f));
        tmp   = _mm256_floor_ps(fx);
__m256  mask  = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS);    
        mask  = _mm256_and_ps(mask, one);
        fx    = _mm256_sub_ps(tmp, mask);
        tmp   = _mm256_mul_ps(fx, cephes_exp_C1);
__m256  z     = _mm256_mul_ps(fx, cephes_exp_C2);
        x     = _mm256_sub_ps(x, tmp);
        x     = _mm256_sub_ps(x, z);
        z     = _mm256_mul_ps(x,x);

__m256  y     = cephes_exp_p0;
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p1);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p2);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p3);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p4);
        y     = _mm256_mul_ps(y, x);
        y     = _mm256_add_ps(y, cephes_exp_p5);
        y     = _mm256_mul_ps(y, z);
        y     = _mm256_add_ps(y, x);
        y     = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0  = _mm256_cvttps_epi32(fx);
        imm0  = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0  = _mm256_slli_epi32(imm0, 23);
__m256  pow2n = _mm256_castsi256_ps(imm0);
        y     = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(1.0f, 2.0f, 3.0f ,4.0f ,5.0f, 6.0f, 7.0f, 8.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

    for (i=0;i<8;i++){
        printf("i = %i, x = %e, y = %e \n",i,xv[i],yv[i]);
    }
    return 0;
}


作为, 应该可以将 _mm256_floor_ps(fx + 0.5f) 替换为 _mm256_round_ps(fx)。此外, mask = _mm256_cmp_ps(tmp, fx, _CMP_GT_OS); 和接下来的两行似乎是多余的。 通过将 cephes_exp_C1cephes_exp_C2 组合成 inv_LOG2EF 可以进一步优化。 这导致以下代码尚未经过彻底测试!

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
/*    gcc -O3 -m64 -Wall -mavx2 -march=broadwell  expc.c -lm     */

__m256 exp256_ps(__m256 x) {
/* Modified code from this source: https://github.com/reyoung/avx_mathfun

   AVX implementation of exp
   Based on "sse_mathfun.h", by Julien Pommier
   http://gruntthepeon.free.fr/ssemath/
   Copyright (C) 2012 Giovanni Garberoglio
   Interdisciplinary Laboratory for Computational Science (LISC)
   Fondazione Bruno Kessler and University of Trento
   via Sommarive, 18
   I-38123 Trento (Italy)
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  (this is the zlib license)

*/
/* 
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int i;
    float xv[8];
    float yv[8];
    __m256 x = _mm256_setr_ps(11.0f, -12.0f, 13.0f ,-14.0f ,15.0f, -16.0f, 17.0f, -18.0f);
    __m256 y = exp256_ps(x);
    _mm256_store_ps(xv,x);
    _mm256_store_ps(yv,y);

 /* compare exp256_ps with the double precision exp from math.h, 
    print the relative error             */
    printf("i      x                     y = exp256_ps(x)      double precision exp        relative error\n\n");
    for (i=0;i<8;i++){ 
        printf("i = %i  x =%16.9e   y =%16.9e   exp_dbl =%16.9e   rel_err =%16.9e\n",
           i,xv[i],yv[i],exp((double)(xv[i])),
           ((double)(yv[i])-exp((double)(xv[i])))/exp((double)(xv[i])) );
    }
    return 0;
}

下一个 table 通过将 exp256_ps 与 math.h 的双精度 exp 进行比较,给出了某些点的准确性印象。 相对误差在最后一列。

i      x                     y = exp256_ps(x)      double precision exp        relative error

i = 0  x = 1.000000000e+00   y = 2.718281746e+00   exp_dbl = 2.718281828e+00   rel_err =-3.036785947e-08
i = 1  x =-2.000000000e+00   y = 1.353352815e-01   exp_dbl = 1.353352832e-01   rel_err =-1.289636419e-08
i = 2  x = 3.000000000e+00   y = 2.008553696e+01   exp_dbl = 2.008553692e+01   rel_err = 1.672817689e-09
i = 3  x =-4.000000000e+00   y = 1.831563935e-02   exp_dbl = 1.831563889e-02   rel_err = 2.501162103e-08
i = 4  x = 5.000000000e+00   y = 1.484131622e+02   exp_dbl = 1.484131591e+02   rel_err = 2.108215155e-08
i = 5  x =-6.000000000e+00   y = 2.478752285e-03   exp_dbl = 2.478752177e-03   rel_err = 4.380257261e-08
i = 6  x = 7.000000000e+00   y = 1.096633179e+03   exp_dbl = 1.096633158e+03   rel_err = 1.849522682e-08
i = 7  x =-8.000000000e+00   y = 3.354626242e-04   exp_dbl = 3.354626279e-04   rel_err =-1.101575118e-08

由于 exp() 的快速计算需要对 IEEE-754 floating-point 操作数的指数字段进行操作,因此 AVX 并不适合此计算,因为它缺少整数运算。因此,我将专注于 AVX2。对 fused-multiply add 的支持在技术上是一个独立于 AVX2 的功能,因此我提供了两个代码路径,使用和不使用 FMA,由宏 USE_FMA.

控制

下面的代码计算 exp() 接近 所需的精度 10-6。使用 FMA 在这里没有提供任何显着改进,但它应该在支持它的平台上提供性能优势。

之前 中用于 lower-precision SSE 实现的算法不能完全扩展到相当准确的实现,因为它包含一些具有较差数值属性的计算,但是,不在这种情况下很重要。而不是计算 ex = 2i * 2ff 在 [ 0,1] 或 f in [-½, ½], 计算 ex = 2i * e 是有利的ff 在较窄的区间 [-½log 2, ½log 2],其中 log 表示自然对数。

为此,我们首先计算 i = rint(x * log2(e)),然后计算 f = x - log(2) * i重要的是,后一种计算需要使用比本机精度更高的精度来传递要传递给核心近似的精确缩减参数。为此,我们使用 Cody-Waite 方案,首次发表于 W. J. Cody & W. Waite,"Software Manual for the Elementary Functions",Prentice Hall 1980。常量 log(2) 被拆分为 "high" 部分更大的量级和更小量级的 "low" 部分,它保持 "high" 部分和数学常数之间的差异。

选择尾数中有足够尾随零位的高位部分,这样 i 与 "high" 部分的乘积 正好 可表示在本机精度。这里我选择了一个 "high" 部分,其中有八个尾随零位,因为 i 肯定适合八位。

本质上,我们计算 f = x - i * log(2)high - i * log(2)low。这个减少的参数被传递到核心近似值,它是一个多项式 minimax approximation,结果像前面的答案一样按 2i 缩放。

#include <immintrin.h>

#define USE_FMA 0

/* compute exp(x) for x in [-87.33654f, 88.72283] 
   maximum relative error: 3.1575e-6 (USE_FMA = 0); 3.1533e-6 (USE_FMA = 1)
*/
__m256 faster_more_accurate_exp_avx2 (__m256 x)
{
    __m256 t, f, p, r;
    __m256i i, j;

    const __m256 l2e = _mm256_set1_ps (1.442695041f); /* log2(e) */
    const __m256 l2h = _mm256_set1_ps (-6.93145752e-1f); /* -log(2)_hi */
    const __m256 l2l = _mm256_set1_ps (-1.42860677e-6f); /* -log(2)_lo */
    /* coefficients for core approximation to exp() in [-log(2)/2, log(2)/2] */
    const __m256 c0 =  _mm256_set1_ps (0.041944388f);
    const __m256 c1 =  _mm256_set1_ps (0.168006673f);
    const __m256 c2 =  _mm256_set1_ps (0.499999940f);
    const __m256 c3 =  _mm256_set1_ps (0.999956906f);
    const __m256 c4 =  _mm256_set1_ps (0.999999642f);

    /* exp(x) = 2^i * e^f; i = rint (log2(e) * x), f = x - log(2) * i */
    t = _mm256_mul_ps (x, l2e);      /* t = log2(e) * x */
    r = _mm256_round_ps (t, _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC); /* r = rint (t) */

#if USE_FMA
    f = _mm256_fmadd_ps (r, l2h, x); /* x - log(2)_hi * r */
    f = _mm256_fmadd_ps (r, l2l, f); /* f = x - log(2)_hi * r - log(2)_lo * r */
#else // USE_FMA
    p = _mm256_mul_ps (r, l2h);      /* log(2)_hi * r */
    f = _mm256_add_ps (x, p);        /* x - log(2)_hi * r */
    p = _mm256_mul_ps (r, l2l);      /* log(2)_lo * r */
    f = _mm256_add_ps (f, p);        /* f = x - log(2)_hi * r - log(2)_lo * r */
#endif // USE_FMA

    i = _mm256_cvtps_epi32(t);       /* i = (int)rint(t) */

    /* p ~= exp (f), -log(2)/2 <= f <= log(2)/2 */
    p = c0;                          /* c0 */
#if USE_FMA
    p = _mm256_fmadd_ps (p, f, c1);  /* c0*f+c1 */
    p = _mm256_fmadd_ps (p, f, c2);  /* (c0*f+c1)*f+c2 */
    p = _mm256_fmadd_ps (p, f, c3);  /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_fmadd_ps (p, f, c4);  /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#else // USE_FMA
    p = _mm256_mul_ps (p, f);        /* c0*f */
    p = _mm256_add_ps (p, c1);       /* c0*f+c1 */
    p = _mm256_mul_ps (p, f);        /* (c0*f+c1)*f */
    p = _mm256_add_ps (p, c2);       /* (c0*f+c1)*f+c2 */
    p = _mm256_mul_ps (p, f);        /* ((c0*f+c1)*f+c2)*f */
    p = _mm256_add_ps (p, c3);       /* ((c0*f+c1)*f+c2)*f+c3 */
    p = _mm256_mul_ps (p, f);        /* (((c0*f+c1)*f+c2)*f+c3)*f */
    p = _mm256_add_ps (p, c4);       /* (((c0*f+c1)*f+c2)*f+c3)*f+c4 ~= exp(f) */
#endif // USE_FMA

    /* exp(x) = 2^i * p */
    j = _mm256_slli_epi32 (i, 23); /* i << 23 */
    r = _mm256_castsi256_ps (_mm256_add_epi32 (j, _mm256_castps_si256 (p))); /* r = p * 2^i */

    return r;
}

如果需要更高的精度,可以使用以下一组系数将多项式逼近的次数提高一个:

/* maximum relative error: 1.7428e-7 (USE_FMA = 0); 1.6586e-7 (USE_FMA = 1) */
const __m256 c0 =  _mm256_set1_ps (0.008301110f);
const __m256 c1 =  _mm256_set1_ps (0.041906696f);
const __m256 c2 =  _mm256_set1_ps (0.166674897f);
const __m256 c3 =  _mm256_set1_ps (0.499990642f);
const __m256 c4 =  _mm256_set1_ps (0.999999762f);
const __m256 c5 =  _mm256_set1_ps (1.000000000f);

我玩了很多,发现了这个,它的相对精度约为 ~1-07e,并且很容易转换为矢量指令。 只有 4 个常量、5 个乘法和 1 个除法,这是内置 exp() 函数的两倍。

float fast_exp(float x)
{
    const float c1 = 0.007972914726F;
    const float c2 = 0.1385283768F;
    const float c3 = 2.885390043F;
    const float c4 = 1.442695022F;      
    x *= c4; //convert to 2^(x)
    int intPart = (int)x;
    x -= intPart;
    float xx = x * x;
    float a = x + c1 * xx * x;
    float b = c3 + c2 * xx;
    float res = (b + a) / (b - a);
    reinterpret_cast<int &>(res) += intPart << 23; // res *= 2^(intPart)
    return res;
}

转换为 AVX(更新)

__m256 _mm256_exp_ps(__m256 _x)
{
    __m256 c1 = _mm256_set1_ps(0.007972914726F);
    __m256 c2 = _mm256_set1_ps(0.1385283768F);
    __m256 c3 = _mm256_set1_ps(2.885390043F);
    __m256 c4 = _mm256_set1_ps(1.442695022F);
    __m256 x = _mm256_mul_ps(_x, c4); //convert to 2^(x)
    __m256 intPartf = _mm256_round_ps(x, _MM_FROUND_TO_ZERO | _MM_FROUND_NO_EXC);
    x = _mm256_sub_ps(x, intPartf);
    __m256 xx = _mm256_mul_ps(x, x);
    __m256 a = _mm256_add_ps(x, _mm256_mul_ps(c1, _mm256_mul_ps(xx, x))); //can be improved with FMA
    __m256 b = _mm256_add_ps(c3, _mm256_mul_ps(c2, xx));
    __m256 res = _mm256_div_ps(_mm256_add_ps(b, a), _mm256_sub_ps(b, a));
    __m256i intPart = _mm256_cvtps_epi32(intPartf); //res = 2^intPart. Can be improved with AVX2!
    __m128i ii0 = _mm_slli_epi32(_mm256_castsi256_si128(intPart), 23);
    __m128i ii1 = _mm_slli_epi32(_mm256_extractf128_si256(intPart, 1), 23);     
    __m128i res_0 = _mm_add_epi32(ii0, _mm256_castsi256_si128(_mm256_castps_si256(res)));
    __m128i res_1 = _mm_add_epi32(ii1, _mm256_extractf128_si256(_mm256_castps_si256(res), 1));
    return _mm256_insertf128_ps(_mm256_castsi256_ps(_mm256_castsi128_si256(res_0)), _mm_castsi128_ps(res_1), 1);
}

对于归一化输入 ([-1,1]),您可以使用多项式逼近:

    // compute Simd exp() at a time (only optimized for Type=float)
    template<typename Type, int Simd>
    inline
    void expFast(float * const __restrict__ data, float * const __restrict__ result) noexcept
    {

        alignas(64)
        Type resultData[Simd];

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    Type(0.0001972591916103993980868836)*data[i] + Type(0.001433947376170863208244555);
        }

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    resultData[i]*data[i] + Type(0.008338950118885968265658448);
        }

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    resultData[i]*data[i] + Type(0.04164162895364054151059463);
        }

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    resultData[i]*data[i] + Type(0.1666645212581130408580066);
        }

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    resultData[i]*data[i] + Type(0.5000045184212300597437206);
        }

        
        for(int i=0;i<Simd;i++)
        {
            resultData[i] =    resultData[i]*data[i] + Type(0.9999999756072401879691824);
        }

        
        for(int i=0;i<Simd;i++)
        {
            result[i] =    resultData[i]*data[i] + Type(0.999999818912344906607359);
        }

    }

对于在 -1 和 1 之间选取的 6400 万个点,它的平均误差为 0.5 ULPS,最大误差为 10 ULPS。它在 AVX1(推土机)上比 std::exp 有 10 倍的加速。

我认为您可以将此函数与整数乘法相结合以支持所有幂。但是简单的乘法部分需要 O(logN) 而不是 O(N) 才能足够快以用于大幂。例如,如果您计算 x^10,那么它本身只需要 1 次额外运算就可以得到 x^20,而不是通过与 x 相乘需要 10 次额外运算。

在循环中使用时,编译器生成:

.L2:
    vmovaps zmm1, ZMMWORD PTR [rax]
    add     rax, 64
    vmovaps zmm0, zmm1
    vfmadd132ps     zmm0, zmm8, zmm9
    vfmadd132ps     zmm0, zmm7, zmm1
    vfmadd132ps     zmm0, zmm6, zmm1
    vfmadd132ps     zmm0, zmm5, zmm1
    vfmadd132ps     zmm0, zmm4, zmm1
    vfmadd132ps     zmm0, zmm3, zmm1
    vfmadd132ps     zmm0, zmm2, zmm1
    vmovaps ZMMWORD PTR [rax-64], zmm0
    cmp     rax, rdx
    jne     .L2

我认为它足够快,可以节省一些周期来处理 integer-powers 输入,可能达到 float (10^38) 的极限。