以算法方式执行浮点加法

Question

我正在尝试理解浮点加法的算法。过去我只需要在纸上做这个，然后把它转换成十进制再转换回来。我正在 HDL 中编写浮点 ALU，因此在这种情况下不起作用。我已经阅读了很多关于该主题的问题，其中最有用的是我在这个例子中使用的，并且阅读了很多文章，但有些概念让我难以理解。我已经在下面的上下文中写下了问题，但为了在这里进行总结，它们位于顶部：

尾数隐含位什么时候是0，什么时候是1？
添加后，我们如何通过算法检查归一化，然后确定向哪个方向移动？
如果其中一个数是负数，尾数的减法是否在2的补码中进行？

借用示例：

00001000111100110110010010011100 (1.46487e-33)
00000000000011000111111010000100 (1.14741e-39)

首先将它们拆分成它们的组成部分（符号、表达式、尾数）

0 00010001 11100110110010010011100
0 00000000 00011000111111010000100

下一步处理它们的隐式整数值

0 00010001 1.11100110110010010011100
0 00000000 0.00011000111111010000100

问题一：是第2个值前面的零整数的原因，因为指数为零

接下来从较小的指数中减去较大的指数并将较小的尾数右移该数量

  00010001
- 00000000
___________
00010001 = 17

0.00000000000000000000110

添加尾数

   0.00000000000000000000110
+  1.11100110110010010011100
______________________________
   1.11100110110010010100010

问题 2： 在这种情况下，MSB 为 1，因此该值已归一化，我们可以将其删除。假设不是。如果 MSB 为 0，它仍会被视为归一化值，还是我们会向左移动以在该位置获得 1？

问题三：假设其中一个数是负数，是减2的补码，还是直接减尾数就够了？

Answer 1

When is the implicit bit in the mantissa 0, and when is it 1?

当（偏置）指数处于最小值（例如 0）时，隐式位为 0。
当（偏置）指数处于最大值时，没有隐式位。该值为无穷大或 NAN。
否则隐式位为 1.

After the addition, how do we algorithmically check for normalization, and then determine which way to shift?

加法（2个同号操作数），如果和的最高位有进位，右移，指数加1。检查指数溢出。

加法（2 个符号相反的操作数）- 这实际上是减法，如果所有有效位都为零，return 为零。否则，如果最高位为零，则根据需要重复左移，递减指数，但不要将指数递减到低于最小值。

If one of the numbers is negative, is the subtraction of the mantissa performed in 2s compliment or not?

没有。常见的FP编码是sign-magnitude。

Question 1: Is the reason for the zero integer in front of the 2nd value because the exponent is zero

是的，有偏指数最小。

Question 2: In this case the MSB is 1 so the value is normalized and we can drop it. Suppose that it weren't. If the MSB was 0 would that still be considered a normalized value or would we shift left to get a 1 in that place?

当符号不同（或两个操作数均为 0.0）时，MSBit 可能为零。如果总和不为零，则如上所述左移。

Question 3: Suppose one of the numbers was negative, is subtraction performed in 2s compliment, or is it enough to simply subtract the mantissas as they are?

2的夸奖不用了。当符号相同时，增加幅度。当符号不同时，翻转第 2 个符号位并调用您的减法代码。

IEEE-754 不使用“尾数”，而是使用 significand per wiki。根据规格，我认为它重要。待会再评论。

Answer 2

答案1

参见 small C++/VCL example of disecting the 32 and 64 bit floats 如何处理浮点数的 normalized/denormalized 和 zero/inf/nan 状态...状态被定义为指数和尾数值的组合。

答案2

不，你不移动，所以 1 在小数点前排在第一位。相反，您可以通过操作数之间的指数差异进行移位。此外，尾数上的 ALU 运算通常在更大的尾数位宽上完成以降低舍入误差......只有结果在归一化后被截断为原始尾数位宽。

答案3

是的，你也可以使用 2'os 补码，所以对于 c=a+b 你可以这样做，例如像这样的 C++（使用已经剖析过的操作数部分）：

// disect a,b to its compounds and add implicit 1 to mantissas if needed
// a = (-1)^a.sig * a.man * 2^a.exp;
// b = (-1)^b.sig * b.man * 2^b.exp;

// here you should handle special cases when operands are (+/-)inf or nan

if (a.sig) a.man=-a.man; // convert mantisas to 2'o complement 
if (b.sig) b.man=-b.man;

sh=a.exp-b.exp; // exponent difference
if (abs(a.man)>=abs(b.man)) // shift the abs smaller operand to avoid additional rounding
   {
   b.man>>=sh;
   c.exp=a.exp;
   }
else
   {
   a.man<<=sh;
   c.exp=b.exp;
   }
c.man=a.man+b.man; // 2'os complement addition
c.sig=0;
if (c.man<0){ c.sig=1; c.man=-c.man; } // convert back to unsigned mantisa

// here you should normalize the c.exp,c.man and remove implicit 1 from mantisa
// and reconstruct result float
// c = (-1)^c.sig * c.man * 2^c.exp;

您也可以在无符号 ALU 上执行此操作，但是您需要按符号和 abs 值对操作数进行排序，这需要更多工作...

以算法方式执行浮点加法

performing floating point addition algorithmically

algorithm

math

floating-point

ieee-754