IEEE 754 浮点数学运算

Question

因此，据我了解，浮点运算是学校数学的导数。乘法和除法你可以在计算后加上或减去指数。在执行我的代码设计（纸上）时，我运行遇到了下面列出的几个问题：

加法and/or减法...

底数和指数不一样怎么处理？
如果指数差大于双整数数据类型的大小怎么办？

我在网上找到了一些东西，但没有真正说明如何处理这个问题。现在从学年级数学开始，您必须先对值进行归一化，然后才能运行对它们进行任何计算。

所以...

2^3 + 3^2 = 8 + 9 = 17

这里需要同样的东西吗？

编辑：我向社区道歉，因为我认为这个问题非常具体。这是使用 2 的幂，因为当前平台是 IA32。我不知道有任何平台可以使用十进制浮点数。我以十进制为例。

马克B回答了第一个问题：

Thankfully, floats are all done with powers of 2, so just normalize the exponent. e.g. using powers of 10 scientific notation.

那么现在第二个问题（如上所列）是，当为了规范化值，您必须移动的数量超过数据类型中的 space 时，您会怎么做？换句话说，如果我说...32 位精度，我必须移动...比如 35 位...以使指数匹配，您如何处理这种情况？ FPU是怎么处理的？

Answer 1

谢天谢地，浮点数都是用 2 的幂完成的，所以只需将指数归一化即可。例如使用科学记数法的 10 次幂：

   3.1e5        0.031e7
+ 2.96e7  ->  + 2.96 e7  
--------        -------
                2.991e7

Answer 2

IEE 754 浮点数只是浮点数的一种实现。像往常一样，维基百科上有不错的 references。

你选择一个基数（一般是 2，但 IEEE 754 也定义了基数 10），然后一个实数表示为 f = sign * significand * base ^exponent，其中significand 和 exponent 都是整数，并且用 i 或 -1 来表示。具体来说，您有：

Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is described by three integers: s = a sign (zero or one), c = a significand (or 'coefficient'), q = an exponent. The numerical value of a finite number is
(−1)s × c × bq
where b is the base (2 or 10), also called radix. For example, if the base is 10, the sign is 1 (indicating negative), the significand is 12345, and the exponent is −3, then the value of the number is −11 × 12345 × 10−3 = −1 × 12345 × .001 = −12.345.

Two infinities: +∞ and −∞.

Two kinds of NaN: a quiet NaN (qNaN) and a signaling NaN (sNaN). A NaN may carry a payload that is intended for diagnostic information indicating the source of the NaN. The sign of a NaN has no meaning, but it may be predictable in some circumstances.

The possible finite values that can be represented in a format are determined by the base b, the number of digits in the significand (precision p), and the exponent parameter emax:

c must be an integer in the range zero through bp−1 (e.g., if b=10 and p=7 then c is 0 through 9999999)

q must be an integer such that 1−emax ≤ q+p−1 ≤ emax (e.g., if p=7 and emax=96 then q is −101 through 90).

Hence (for the example parameters) the smallest non-zero positive number that can be represented is 1×10−101 and the largest is 9999999×1090 (9.999999×1096), and the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.

Zero values are finite values with significand 0. These are signed zeros, the sign bit specifies if a zero is +0 (positive zero) or −0 (negative zero).

请查看参考页面以了解（许多）更多详细信息...

IEEE 754 浮点数学运算

IEEE 754 Floating point math operations

c

c++

math

floating-point

biginteger