展开循环会影响内部计算的准确性吗?

Does unrolling a loop affect the accuracy of the computations within?

问题总结展开循环会影响循环内执行的计算的准确性吗?如果是,为什么?

详细说明和背景我正在使用 HLSL 编写计算着色器以用于 Unity 项目 (2021.2.9f1)。我的部分代码包括数值过程和高度振荡函数,这意味着高计算精度是必不可少的。

将我的结果与 Python 中的等效程序进行比较时,我注意到在 1e-5 的顺序上存在一些偏差。这是令人担忧的,因为我没想到如此大的错误是精度差异的结果,例如 HLSL 中的三角函数或幂函数的浮点精度。

最后,经过多次调试,我现在认为选择展开或不展开循环是导致偏差的原因。但是,我确实觉得这很奇怪,因为我似乎找不到任何来源表明除了“space–时间权衡”之外,展开循环还会影响准确性。

澄清一下,如果将我的 Python 结果视为正确的解决方案,则在 HLSL 中展开循环比不展开循环给我的结果更好。

最小工作示例 下面是一个 MWE,由一个用于 Unity 的 C# 脚本、执行计算的相应计算着色器和 [=] 时我的控制台的屏幕截图组成60=] 在 Unity (2021.2.9f1) 中。请原谅我对牛顿法的实施有些混乱,但我选择保留它,因为我认为这可能是导致这种偏差的原因。也就是说,如果只是简单的计算cos(x),那么展开和没有展开是没有区别的。 None少了,我还是没看懂在测试内核中简单的加[unroll(N)]是怎么改变结果的...

// C# for Unity
using UnityEngine;

public class UnrollTest : MonoBehaviour
{

    [SerializeField] ComputeShader CS;
    ComputeBuffer CBUnrolled, CBNotUnrolled;
    readonly int N = 3;

    private void Start()
    {

        CBUnrolled = new ComputeBuffer(N, sizeof(double));
        CBNotUnrolled = new ComputeBuffer(N, sizeof(double));

        CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
        CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);

        CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);

        double[] ansUnrolled = new double[N];
        double[] ansNotUnrolled = new double[N];

        CBUnrolled.GetData(ansUnrolled);
        CBNotUnrolled.GetData(ansNotUnrolled);

        for (int i = 0; i < N; i++)
        {
            Debug.Log("Unrolled ans = " + ansUnrolled[i] + 
                "  -  Not Unrolled ans = " + ansNotUnrolled[i] +  
                "  --  Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
        }
        CBUnrolled.Release();
        CBNotUnrolled.Release();
    }
}
#pragma kernel CSMain

RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;

// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
    return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}

// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
    return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}

// Function to solve.
double f(double fnh, double h, double theta)
{
    // Solved using Newton's method.
    int max_iter = 50;
    double epsilon = 1e-8;
    double fxn, dfxn;

    // Define initial guess for k, herby denoted as x.
    double xn = 10.0;

    for (int n = 0; n < max_iter; n++)
    {
        fxn = fDummy(xn, fnh, h, theta);
        
        if (abs(fxn) < epsilon)     // A solution is found.
            return xn;
        
        dfxn = dfDummy(xn, fnh, h, theta);

        if (dfxn == 0.0)    // No solution found.
            return xn;

        xn = xn - fxn / dfxn;
    }

    // No solution found.
    return xn;
}

[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    int N = 3;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;   // Example values.
   
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {   
        _CBNotUnrolled[i] = f(fnh, h, theta);
        theta += dtheta;
    }
    
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01;          // Example values.

    [unroll(N)] for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = f(fnh, h, theta);
        theta += dtheta;
    }
}

Image of Unity console when running the above

编辑 经过更多测试,偏差已缩小到以下代码,完全相同的代码展开与未展开之间的差异约为 1e-17。尽管差异很小,但我仍然认为这是该问题的有效示例,因为我认为它们应该相等。

[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
    if ((int) threadID.x != 1)
        return;
    
    int N = 3;
    double k = 1.0;
    
    // ---------------
    double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    for (int i = 0; i < N; i++)                 // Not being unrolled
    {
        _CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
   
    // ---------------
    fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
 
    [unroll(N)]
    for (int j = 0; j < N; j++)     // Being unrolled.
    {
        _CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
        theta += dtheta;
    }
}

Image of Unity console when running the edited script above

Edit 2 以下是Edit 1中给出的内核的编译代码。不幸的是,我对汇编语言的经验有限,我无法发现是否存在这种情况脚本显示任何错误,或者它是否对手头的问题有用。

**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
//       Double-precision floating point
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      cs_5_0
      dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
      dcl_uav_structured u0, 8
      dcl_uav_structured u1, 8
      dcl_input vThreadID.x
      dcl_temps 2
      dcl_thread_group 64, 1, 1
   0: ine r0.x, vThreadID.x, l(1)
   1: if_nz r0.x
   2:   ret 
   3: endif 
   4: dmov r0.xy, d(-0.161000l, 0.000000l)
   5: mov r0.z, l(0)
   6: loop 
   7:   ige r0.w, r0.z, l(3)
   8:   breakc_nz r0.w
   9:   dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
  10:   dadd r1.xy, -r1.zwzw, r1.xyxy
  11:   store_structured u1.xy, r0.z, l(0), r1.xyxx
  12:   dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
  13:   iadd r0.z, r0.z, l(1)
  14: endloop 
  15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
  16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
  17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
  18: ret 
// Approximately 0 instruction slots used

编辑 3 在与 Microsoft 联系后(参见 https://docs.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html),他们表示问题更多与 Unity 有关。这是因为

"The pragma unroll [(n)] is keil compiler which Unity uses topic"

这取决于驱动程序、硬件、编译器和统一性。

本质上,HLSL 规范对数学运算的舍入行为的保证比常规 IEEE-754 浮点要宽松一些。

首先,implementation-dependent运算是向上还是向下舍入。

IEEE-754 requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result, known as round-to-nearest-even. Direct3D 10, however, defines a looser requirement: 32-bit floating-point operations produce a result that is within one unit-last-place (1 ULP) of the infinitely-precise result. This means that, for example, hardware is allowed to truncate results to 32-bit rather than perform round-to-nearest-even, as that would result in error of at most one ULP.

更进一步,HLSL 编译器本身有许多 fast-math 优化可能违反 IEEE-754 浮点一致性;参见,例如:

D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor. D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.

这对您的场景特别重要,因为如果启用优化,循环展开的存在会触发不断的折叠优化,从而降低代码的计算成本并改变其结果的精度(甚至可能提高它们)。请注意,当常量折叠发生时,编译器必须决定如何执行舍入,这可能与您的硬件 FPU 将执行的操作不一致。

哦,注意 IEEE-754 没有对“附加操作”(例如 sin、cos、tanh、atan、ln 等)的精度施加限制,更不用说要求实现了;它纯粹推荐它们。

  • 看,这是一个非常常见的情况,其中出现错误并且 sin 在英特尔集成显卡上被量化为 4 个不同的值,但在其他硬件上具有合理的精度:sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000

另请注意,Unity 不保证着色器中的 float 实际上是 32 位浮点数;在某些硬件(例如移动设备)上,它甚至可以由 16 位 half 或 11 位 fixed.

支持

High precision: float Highest precision floating point value; generally 32 bits (just like float from regular programming languages).

... One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.

The half and fixed types only become relevant when targeting mobile GPUs, where these types primarily exist for power (and sometimes performance) constraints. Keep in mind that you need to test your shaders on mobile to see whether or not you are running into precision/numerical issues.

Even on mobile GPUs, the different precision support varies between GPU families.

我认为 Unity 不会向开发人员公开编译器标志;你对它传递给 dxc/fxc 的优化一时兴起。鉴于它主要用于游戏,您可以打赌它们会启用优化。

最后,如果您想 in-depth 深入了解这个主题,请查看 "Floating-Point Determinism" by Bruce Dawson;我要补充的是,如果您希望语言之间的结果一致(因为语言本身可以自己实现数学函数而不是使用硬件内在函数,例如为了更好的精度),这个问题也存在,当 cross-compiling (因为不同的编译器/后端可以优化不同或使用不同的系统库),或者 运行 跨不同运行时管理代码(例如,因为 JIT 可以进行不同的优化)。