当结果不是sub-normal时，是否可以设置floating-point状态标志FE_UNDERFLOW？

Question

在调查 floating-point 异常状态标志时，我遇到了一个奇怪的情况，即状态标志 FE_UNDERFLOW 在不期望的情况下设置。

这与类似，但涉及到可能是 C 规范问题或 FP 硬件缺陷的极端情况。

// pseudo code
//                       s bias_expo implied "mantissa"
w = smallest_normal;  // 0 000...001 (1)     000...000
x = w * 2;            // 0 000...010 (1)     000...000
y = next_smaller(x);  // 0 000...001 (1)     111...111
round_mode(FE_TONEAREST);
clear_status_flags();
z = y/2;              // 0 000...001 (1)     000...000

FE_UNDERFLOW is set!?

没想到FE_UNDERFLOW设置成上面的z是正常的，不是sub-normal.
我预计 FE_UNDERFLOW 早期 floating-point 操作的结果 低于正常值 且精度损失。在这种情况下会损失精度。

我用我的 float 和 long double 试过了，得到了同样的结果。

经过大量调查，我注意到 __STDC_IEC_559__ 未定义。

问题

如果定义了__STDC_IEC_559__，这种情况下下溢的正确状态是什么？
由于缺少已定义的 __STDC_IEC_559__，我是否坚持“未定义 __STDC_IEC_559__ 的实现不需要符合这些 specifications." C11 或者是否有一些 C 规范表明这个结果不正确？
由于这肯定是我的硬件（处理器）的结果，您的结果可能会有所不同，知道这一点会很有趣。

下面是一些演示这个的测试代码。起初我怀疑是因为 FLT_EVAL_METHOD = 2 在我的机器上，但后来我用 long double 尝试了类似的代码，结果相同。

// These 2 includes missing in original post, yet in my true test code
#include <float.h>
#include <math.h>

#include <fenv.h>
#include <stdio.h>
#include <stdint.h>

#define N (sizeof excepts/sizeof excepts[0])
void Report_IEC_FP_exception_status_flags(const char *s) {
  printf("%s", s);
  int excepts[] = { //
      FE_DIVBYZERO, FE_INEXACT, FE_INVALID, FE_OVERFLOW, FE_UNDERFLOW, };
  const char *excepts_str[N] = { //
      "FE_DIVBYZERO", "FE_INEXACT", "FE_INVALID", "FE_OVERFLOW", "FE_UNDERFLOW", };
  int excepts_val[N];

  for (unsigned i = 0; i < N; i++) {
    excepts_val[i] = fetestexcept(excepts[i]);
  }
  for (unsigned i = 0; i < N; i++) {
    if (excepts_val[i]) printf(" %s", excepts_str[i]);
  }
  printf("\n");
  fflush(stdout);
}
#undef N

void test2(float f, int round_mode, const char *name) {
  union {
    float f;
    uint32_t u32;
    } x = { .f = f};

  printf("x:%+17a %08lX normal:%c round_mode:%d %s\n", //
  f, (unsigned long) x.u32, isnormal(f) ? 'Y' : 'n', round_mode, name);
  if (feclearexcept(FE_ALL_EXCEPT)) puts("Clear Fail");
  Report_IEC_FP_exception_status_flags("Before:");

  f /= 2;

  Report_IEC_FP_exception_status_flags("After :");
  printf("y:%+17a %08lX normal:%c\n\n", 
      f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
}

Driver

// In same file as above
int main(void) {
  #ifdef __STDC_IEC_559__
    printf("__STDC_IEC_559__ = %d\n", __STDC_IEC_559__);
  #else
    printf("__STDC_IEC_559__ = not define\n");
  #endif

  float f = FLT_MIN;
  printf("FLT_EVAL_METHOD = %d\n", FLT_EVAL_METHOD);
  printf("FLT_MIN:%+17a\n", f);
  f *= 2.0f;
  test2(f, FE_TONEAREST, "FE_TONEAREST");
  f = nextafterf(f, 0);
  test2(f, FE_TONEAREST, "FE_TONEAREST");   // *** problem? ***
  f = nextafterf(f, 0);
  test2(f, FE_TONEAREST, "FE_TONEAREST");
}

输出

__STDC_IEC_559__ = not define
FLT_EVAL_METHOD = 2
FLT_MIN:        +0x1p-126
x:        +0x1p-125 01000000 normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y:        +0x1p-126 01000000 normal:Y

x: +0x1.fffffep-126 00FFFFFF normal:Y round_mode:0 FE_TONEAREST
Before:
After : FE_INEXACT FE_UNDERFLOW                *** Why FE_UNDERFLOW? ***
y:        +0x1p-126 00FFFFFF normal:Y          *** Result is normal  ***

x: +0x1.fffffcp-126 00FFFFFE normal:Y round_mode:0 FE_TONEAREST
Before:
After :
y: +0x1.fffffcp-127 00FFFFFE normal:n

参考

IEEE_754

实施说明：

GNU C11 (GCC) 版本 6.4.0 (i686-pc-cygwin) 由 GNU C 版本 6.4.0、GMP 版本 6.1.2、MPFR 版本 3.1.5-p10、MPC 版本 1.0.3、isl 版本 0.14 或 0.13 编译

glibc 2.26 发布。

英特尔至强 W3530，64 位 OS（Windows 7）

[次要更新] 作为 32 位十六进制数的商的说明性打印本应使用 y.u32。这不会改变被测函数

// printf("y:%+17a %08lX normal:%c\n\n", 
//    f,(unsigned long) x.u32, isnormal(f) ? 'Y' : 'n');
union {
  float f;
  uint32_t u32;
} y = { .f = f};
printf("y:%+17a %08lX normal:%c\n\n", 
    f,(unsigned long) y.u32, isnormal(f) ? 'Y' : 'n');
//                    ^^^^^

Answer 1

虽然不打算作为 self answer, input from various commenters , 并且进一步的研究导致：

Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

是，在狭窄的情况下 - 当数学答案位于次正常和正常之间时。见下文。

“下溢”发生在以下情况：

The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type. C11 7.12.1 Treatment of error conditions

上面的 z = y/2; 是 1) 不准确（由于四舍五入）和 2) 可能被认为“太小”。

数学

z = y/2;可以认为经历了两个阶段：除法和舍入。无限精度的数学商小于最小的正常数 FLT_MIN 并且大于最大的 sub-normal number nextafterf(FLT_MIN,0)。根据舍入模式，最终答案是这两者之一。使用 FE_TONEAREST，z 被分配 FLT_MIN，一个正常的数字。

规格

下面的 C 规范和 IEC 60559 表明

The "underflow" floating-point exception is raised whenever a result is tiny (essentially subnormal or zero) and suffers loss of accuracy.³⁵⁸ C11 §F.10 7.
³⁵⁸ IEC 60559 allows different definitions of underflow. They all result in the same values, but differ on when the floating-point exception is raised.

和

Two definitions were allowed for the determination of the 'tiny' condition: before or after rounding the infinitely precise result to working precision, with unbounded exponent.

Annex U of 754r recommended that only tininess after rounding and inexact as loss of accuracy be a cause for underflow signal. wiki reference

（我的重点）

问答

If STDC_IEC_559 is defined, what is the correct state of underflow in this case?

在这种情况下，下溢标志可能会被设置或保留。要么符合。但是有一个偏好，即不设置下溢标志。

2 With lack of a defined STDC_IEC_559 am I stuck with "Implementations that do not define STDC_IEC_559 are not required to conform to these specifications." C11 or is there some C specification that indicates this result is incorrect?

下溢标志的设置结果没有错误。 FP 规范允许这种行为。它还允许不设置下溢标志。

3 Since this is certainly a result of my hardware (processor), your result may differ and that would be interesting to know.

在另一个平台上，__STDC_IEC_559__ = not define 和 FLT_EVAL_METHOD = 0 都设置了 FE_INEXACT FE_UNDERFLOW 标志，就像上面的 tst 情况一样。该问题适用于 float, double, long double.

如果数学答案位于下面的灰色“介于”区域中，它将向下舍入到低于正常值 double 或向上舍入到正常值 double DBL_MIN，具体取决于关于它的价值和舍入模式。如果向下舍入，则肯定设置 FE_UNDERFLOW。如果四舍五入，则可以设置或不设置 FE_UNDERFLOW，具体取决于应用 'tiny' 条件的确定时间。

当结果不是sub-normal时，是否可以设置floating-point状态标志FE_UNDERFLOW？

Can the floating-point status flag FE_UNDERFLOW set when the result is not sub-normal?

c

floating-point

ieee-754