IEEE 754 浮点数学运算
IEEE 754 Floating point math operations
因此,据我了解,浮点运算是学校数学的导数。乘法和除法你可以在计算后加上或减去指数。在执行我的代码设计(纸上)时,我 运行 遇到了下面列出的几个问题:
加法and/or减法...
- 底数和指数不一样怎么处理?
- 如果指数差大于双整数数据类型的大小怎么办?
我在网上找到了一些东西,但没有真正说明如何处理这个问题。现在从学年级数学开始,您必须先对值进行归一化,然后才能 运行 对它们进行任何计算。
所以...
2^3 + 3^2 = 8 + 9 = 17
这里需要同样的东西吗?
编辑:我向社区道歉,因为我认为这个问题非常具体。这是使用 2 的幂,因为当前平台是 IA32。我不知道有任何平台可以使用十进制浮点数。我以十进制为例。
马克B回答了第一个问题:
Thankfully, floats are all done with powers of 2, so just normalize the exponent. e.g. using powers of 10 scientific notation.
那么现在第二个问题(如上所列)是,当为了规范化值,您必须移动的数量超过数据类型中的 space 时,您会怎么做?换句话说,如果我说...32 位精度,我必须移动...比如 35 位...以使指数匹配,您如何处理这种情况? FPU是怎么处理的?
谢天谢地,浮点数都是用 2 的幂完成的,所以只需将指数归一化即可。例如使用科学记数法的 10 次幂:
3.1e5 0.031e7
+ 2.96e7 -> + 2.96 e7
-------- -------
2.991e7
IEE 754 浮点数只是浮点数的一种实现。像往常一样,维基百科上有不错的 references。
你选择一个基数(一般是 2,但 IEEE 754 也定义了基数 10),然后一个实数表示为 f = sign * significand * base exponent,其中significand 和 exponent 都是整数,并且用 i 或 -1 来表示。具体来说,您有:
Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is described by three integers: s = a sign (zero or one), c = a significand (or 'coefficient'), q = an exponent. The numerical value of a finite number is
(−1)s × c × bq
where b is the base (2 or 10), also called radix. For example, if the base is 10, the sign is 1 (indicating negative), the significand is 12345, and the exponent is −3, then the value of the number is −11 × 12345 × 10−3 = −1 × 12345 × .001 = −12.345.
Two infinities: +∞ and −∞.
Two kinds of NaN: a quiet NaN (qNaN) and a signaling NaN (sNaN). A NaN may carry a payload that is intended for diagnostic information indicating the source of the NaN. The sign of a NaN has no meaning, but it may be predictable in some circumstances.
The possible finite values that can be represented in a format are determined by the base b, the number of digits in the significand (precision p), and the exponent parameter emax:
c must be an integer in the range zero through bp−1 (e.g., if b=10 and p=7 then c is 0 through 9999999)
q must be an integer such that 1−emax ≤ q+p−1 ≤ emax (e.g., if p=7 and emax=96 then q is −101 through 90).
Hence (for the example parameters) the smallest non-zero positive number that can be represented is 1×10−101 and the largest is 9999999×1090 (9.999999×1096), and the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.
Zero values are finite values with significand 0. These are signed zeros, the sign bit specifies if a zero is +0 (positive zero) or −0 (negative zero).
请查看参考页面以了解(许多)更多详细信息...
因此,据我了解,浮点运算是学校数学的导数。乘法和除法你可以在计算后加上或减去指数。在执行我的代码设计(纸上)时,我 运行 遇到了下面列出的几个问题:
加法and/or减法...
- 底数和指数不一样怎么处理?
- 如果指数差大于双整数数据类型的大小怎么办?
我在网上找到了一些东西,但没有真正说明如何处理这个问题。现在从学年级数学开始,您必须先对值进行归一化,然后才能 运行 对它们进行任何计算。
所以...
2^3 + 3^2 = 8 + 9 = 17
这里需要同样的东西吗?
编辑:我向社区道歉,因为我认为这个问题非常具体。这是使用 2 的幂,因为当前平台是 IA32。我不知道有任何平台可以使用十进制浮点数。我以十进制为例。
马克B回答了第一个问题:
Thankfully, floats are all done with powers of 2, so just normalize the exponent. e.g. using powers of 10 scientific notation.
那么现在第二个问题(如上所列)是,当为了规范化值,您必须移动的数量超过数据类型中的 space 时,您会怎么做?换句话说,如果我说...32 位精度,我必须移动...比如 35 位...以使指数匹配,您如何处理这种情况? FPU是怎么处理的?
谢天谢地,浮点数都是用 2 的幂完成的,所以只需将指数归一化即可。例如使用科学记数法的 10 次幂:
3.1e5 0.031e7
+ 2.96e7 -> + 2.96 e7
-------- -------
2.991e7
IEE 754 浮点数只是浮点数的一种实现。像往常一样,维基百科上有不错的 references。
你选择一个基数(一般是 2,但 IEEE 754 也定义了基数 10),然后一个实数表示为 f = sign * significand * base exponent,其中significand 和 exponent 都是整数,并且用 i 或 -1 来表示。具体来说,您有:
Finite numbers, which may be either base 2 (binary) or base 10 (decimal). Each finite number is described by three integers: s = a sign (zero or one), c = a significand (or 'coefficient'), q = an exponent. The numerical value of a finite number is
(−1)s × c × bq
where b is the base (2 or 10), also called radix. For example, if the base is 10, the sign is 1 (indicating negative), the significand is 12345, and the exponent is −3, then the value of the number is −11 × 12345 × 10−3 = −1 × 12345 × .001 = −12.345.Two infinities: +∞ and −∞.
Two kinds of NaN: a quiet NaN (qNaN) and a signaling NaN (sNaN). A NaN may carry a payload that is intended for diagnostic information indicating the source of the NaN. The sign of a NaN has no meaning, but it may be predictable in some circumstances.
The possible finite values that can be represented in a format are determined by the base b, the number of digits in the significand (precision p), and the exponent parameter emax:
c must be an integer in the range zero through bp−1 (e.g., if b=10 and p=7 then c is 0 through 9999999)
q must be an integer such that 1−emax ≤ q+p−1 ≤ emax (e.g., if p=7 and emax=96 then q is −101 through 90).
Hence (for the example parameters) the smallest non-zero positive number that can be represented is 1×10−101 and the largest is 9999999×1090 (9.999999×1096), and the full range of numbers is −9.999999×1096 through 9.999999×1096. The numbers −b1−emax and b1−emax (here, −1×10−95 and 1×10−95) are the smallest (in magnitude) normal numbers; non-zero numbers between these smallest numbers are called subnormal numbers.
Zero values are finite values with significand 0. These are signed zeros, the sign bit specifies if a zero is +0 (positive zero) or −0 (negative zero).
请查看参考页面以了解(许多)更多详细信息...