添加浮点数与将浮点数乘以整数的准确性

Question

在我的计算机科学课程中，我们正在研究浮点数及其在内存中的表示方式。我已经了解它们在内存中的表示方式（mantissa/significand、指数及其偏差以及符号位），并且我了解浮点数是如何相互加减的（非规范化和所有这些有趣的东西） .但是，在查看一些学习问题时，我发现了一些我无法解释的问题。

当一个无法精确表示的浮点数多次与自身相加时，答案低于我们在数学上的预期，但是当同一个浮点数乘以一个整数时，答案就会精确地出现数.

这是我们学习题中的一个例子（例子写在Java中，为了简单起见，我把它删减了）：

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;
float p = min + (width * count);

在这个例子中，我们被告知结果正好是 10.0。然而，如果我们把这个问题看成是浮点数的总和，我们会得到稍微不同的结果：

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;

for(float p=min; p <= max; p += width){
    System.out.printf("%f%n", p);
}

我们被告知在这个测试中 p 的最终值是 ~9.999999，p 的最后一个值与 p 的值相差 -9.536743E-7 max。从逻辑的角度来看（知道浮动是如何工作的），这个值是有道理的。

不过，我不明白的是为什么我们在第一个示例中得到的恰好是 10.0。从数学上讲，我们得到 10.0 是有道理的，但是知道浮点数是如何存储在内存中的，这对我来说没有意义。谁能解释为什么我们通过将一个不精确的浮点数与一个整数相乘来得到一个精确和准确的值？

编辑：为了澄清，在最初的研究问题中，一些值被传递给函数，其他值在函数外声明。我的示例代码是研究问题示例的缩短和简化版本。因为有些值是传递给函数而不是显式定义为常量，相信可以排除编译时simplification/optimization

Answer 1

因为 1.0 + ((10.0 - 1.0) / 10.0) * 10.0 只用不精确的值进行了 1 次计算，因此有 1 次舍入误差，所以它比对 0.9f 的浮点数进行 10 次加法更准确。我认为这就是本例中要教授的原则。

关键问题是0.1不能用浮点数精确表示。所以 0.9 有错误，这些错误在函数循环中加起来。

"exact" 数字可能是由于巧妙的输出格式化例程而显示的。当我第一次使用计算机时，他们喜欢将这些数字以一种荒谬的科学固定数字格式输出，这对人类来说是不友好的。

我想了解发生了什么我会找到 Koenig 的 Dr Dobbs 博客 post 关于这个主题，这是一本很有启发性的读物，该系列通过展示像 perl 这样的语言，python &如果计算足够精确，可能 java 会使计算看起来准确。

Koenig's Dr Dobbs article on floating point

Even Simple Floating-Point Output Is Complicated

如果定点运算在 5-10 年后被添加到 CPU 中，请不要太惊讶，财务人员喜欢准确的求和。

Answer 2

首先，吹毛求疵：

When a float that cannot be precisely represented

没有"float that cannot be precisely represented."所有float可以精确表示为float

is added to itself several times, the answer is lower than we would mathematically expect,

当您多次将一个数字与自身相加时，您实际上可以获得比您预期的更高的结果。我将使用 C99 hexfloat notation。考虑 f = 0x1.000006p+0f。然后 f+f = 0x1.000006p+1f、f+f+f = 0x1.800008p+1f、f+f+f+f = 0x1.000006p+2f、f+f+f+f+f = 0x1.400008p+2f、f+f+f+f+f+f = 0x1.80000ap+2f 和 f+f+f+f+f+f+f = 0x1.c0000cp+2f。但是，7.0*f = 0x1.c0000a8p+2四舍五入为0x1.c0000ap+2f，小于f+f+f+f+f+f+f.

but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.

7 * 0x1.000006p+0f 不能表示为 IEEE float。因此它会变圆。使用 round-to-nearest-with-ties-going-to-even 的默认舍入模式，当你像这样进行一次算术运算时，你会得到最接近你的精确结果的浮点数。

The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?

回答你的问题，你得到不同的结果是因为你做了不同的操作。您在这里得到 "right" 答案有点侥幸。

让我们调换数字。如果我计算 0x1.800002p+0f / 3，我得到 0x1.00000155555...p-1，四舍五入为 0x1.000002p-1f。当我将其翻三倍时，我得到 0x1.800003p+0f，这四舍五入（因为我们打破平局）到 0x1.800004p+0f。这与我在 float 算术中计算 f+f+f 得到的结果相同，其中 f = 0x1.000002p-1f.

添加浮点数与将浮点数乘以整数的准确性

Accuracy of Adding Floats vs. Multiplying Float by Integer

floating-point

precision

floating-accuracy