具有 MPFR 的不同精度的次正规数

Question

我想模拟各种 n 位二进制浮点格式，每个格式都有指定的 e_max 和e_min，精度为 p。我希望这些格式能够模拟次正规数，忠实于 IEEE-754 标准。

自然地，我的搜索将我带到了 MPFR 库，它符合 IEEE-754 标准并且能够通过 mpfr_subnormalize() 函数支持次正规。但是，我运行对使用 mpfr_set_emin() 和 mpfr_set_emax() 正确设置启用次正规的环境感到困惑。我将使用 IEEE 双精度作为示例格式，因为这是 MPFR 手册中使用的示例：

http://mpfr.loria.fr/mpfr-current/mpfr.html#index-mpfr_005fsubnormalize

mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);

以上代码来自上面link中的MPFR手册——注意既不是e_max也不是e_min 等于 double 的预期值。这里，p设置为53，和double类型一样，但是e_max设置为1024，而不是正确的值1023，而e_min设置为-1073；远低于 -1022 的正确值。我知道在 MPFR 的中间计算中将指数边界设置得太紧会导致 overflow/underflow，但我发现设置 e_min 对于确保正确的次正规来说是至关重要的数字；太高或太低导致低于正常的 MPFR 结果（更新为 mprf_subnormalize()）与相应的 double 结果不同。

我的问题是应该如何决定将哪些值传递给 mpfr_set_emax() 和（尤其是）mpfr_set_emin()，以便 gua运行tee 纠正浮点数的次正规行为具有指数范围 e_max 和 e_min 的格式？似乎没有关于此事的任何详细文档或讨论。

非常感谢，

詹姆斯.

编辑 2016 年 7 月 30 日：这是一个小程序，演示了 e_max 和 e_min[= 的选择52=] 用于单精度数字。

#include <iostream> #include <cmath> #include <float.h> #include <mpfr.h> using namespace std; int main (int argc, char *argv[]) { cout.precision(120); // IEEE-754 float emin and emax values don't work at all //mpfr_set_emin (-126); //mpfr_set_emax (127); // Not quite //mpfr_set_emin (-147); //mpfr_set_emax (128); // Not quite //mpfr_set_emin (-149); //mpfr_set_emax (128); // These float emin and emax values work in subnormal range mpfr_set_emin (-148); mpfr_set_emax (128); cout << "emin: " << mpfr_get_emin() << " emax: " << mpfr_get_emax() << endl; float f = FLT_MIN; for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY); mpfr_t m; mpfr_init2 (m, 24); mpfr_set_flt (m, f, MPFR_RNDN); for (int i = 0; i < 6; i++) { f = nextafterf(f, 0); mpfr_nextbelow(m); cout << i << ": float: " << f << endl; //cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl; mpfr_subnormalize (m, 1, MPFR_RNDN); cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl; } mpfr_clear (m); return 0; }

Answer 1

我正在复制我在 ResearchGate 上给出的答案（带有 link 到 mpfr_subnormalize 文档）：

表示有效数字和相关指数有不同的约定。 IEEE 754 选择考虑 1 和 2 之间的有效数，而 MPFR（与 C 语言一样，例如参见 [=11=]）选择考虑 1/2 和 1 之间的有效数（出于与多精度相关的实际原因）。例如，数字 17 在 IEEE 754 中表示为 1.0001·2⁴，在 MPFR 中表示为 0.10001·2⁵。如您所见，这意味着与 IEEE 754 相比，MPFR 中的指数增加了 1，因此 e_max = 1024 而不是双精度的 1023。

关于双精度e_min的选择，需要能够表示2⁻¹⁰⁷⁴ = 0.1·2⁻¹⁰⁷³，所以e_min最多需要-1073（在MPFR中，所有数字都归一化）。

如文档所述，mpfr_subnormalize 函数认为次正规指数范围是从 e_min 到 e_min + PREC (x) − 1，例如，您需要设置 e_min = −1073 来模拟 IEEE 双精度。

具有 MPFR 的不同精度的次正规数

Subnormal numbers in different precisions with MPFR

c

c++

floating-point

mpfr

multiprecision