使用 use_fast_math 标志编译时，更改代码的一部分会影响另一部分

Question

我有以下内核：

__global__ void kernel()
{
  float loc{269.0f};
  float s{(356.0f - loc) / 13.05f};

  float a{pow(1.0f - 0.15f * s, 1.0f)};
  float b{pow(1.0f - 0.15f * (356.0f - loc) / 13.05f, 1.0f)};

  printf("%f\n", a);
  printf("%f\n", b);
}

它打印

0.000000
0.000000

但是如果我改变 b 的计算方式：

__global__ void kernel()
{
  float loc{269.0f};
  float s{(356.0f - loc) / 13.05f};

  float a{pow(1.0f - 0.15f * s, 1.0f)};
  float b{pow(1.0f - 0.15f * ((356.0f - loc) / 13.05f), 1.0f)}; // notice the added braces

  printf("%f\n", a);
  printf("%f\n", b);
}

它打印：

nan
nan

为什么为 b 添加大括号也会更改 a？为什么牙套有效果？查看 godbolt.org 上的代码，我发现生成的程序集不同，但我没有足够的知识来理解究竟是什么导致了这种行为。

这是我的项目配置：

set_target_properties(test PROPERTIES
    CXX_STANDARD 14
    CXX_STANDARD_REQUIRED YES
    CXX_EXTENSIONS NO
    CUDA_STANDARD 14
    CUDA_STANDARD_REQUIRED YES
    CUDA_EXTENSIONS NO
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_ARCHITECTURES "61"
)

set(CUDA_FLAGS --use_fast_math)
set(CXX_FLAGS -O0)

请注意标志 --use_fast_math - 没有它，一切正常。我的 GPU 是 Quadro P1000。 Cuda 编译工具，版本 11.2，V11.2.152.

Answer 1

你的代码数值不稳定：它包含一个catastrophic cancellation. Indeed, 1.0f - 0.15f * s and 1.0f - 0.15f * ((356.0f - loc) / 13.05f) are nearly equal to 0. It can be equal to few unit in the last place (ULP). Regarding the rounding, the value can be positive, negative or 0. When the base value is negative, the result of pow is undefined. The thing is the rounding is dependent of the compiler heuristic since you explicitly enabled fast-math that disable some IEEE-754 rules like taking care of the floating-point associativity and the possible presence of NaN values. With fast-math, the compiler is free to use approximations like a reciprocal (less accurate) instead of the basic division. In fact it actually do that in your case. Consequently, results are undefined and the compiler is free to set the output to 0 or NaN regarding its heuristic. I think that having a pow exponent set to 1.0f (which is well represented) does not save you from this behaviour with pow（注意你应该使用powf作为float值而不是pow 用于 double 个值。

解决这个问题的正确方法是使用稳定的数值计算，它不包含任何像您的代码中那样的灾难性取消。实际解决方案非常依赖于整体计算方法。有时，需要使用完全不同的算法（例如，QR 分解而不是 LU 分解，2 步方差计算而不是 1 步）。

使用 use_fast_math 标志编译时，更改代码的一部分会影响另一部分

Changing one part of code affects the other part when compiling with use_fast_math flag

cuda