将双精度数分配给 C 中的 int 变量的非直观结果

Question

谁能解释一下为什么我得到两个不同的数字，分别。 14 和 15，作为以下代码的输出？

#include <stdio.h>  

int main()
{
    double Vmax = 2.9; 
    double Vmin = 1.4; 
    double step = 0.1; 

    double a =(Vmax-Vmin)/step;
    int b = (Vmax-Vmin)/step;
    int c = a;

    printf("%d  %d",b,c);  // 14 15, why?
    return 0;
}

我希望在这两种情况下都能得到 15，但我似乎缺少该语言的一些基础知识。

我不确定它是否相关，但我在 CodeBlocks 中进行了测试。但是，如果我在某些在线编译器 (this one for example) 中键入相同的代码行，我会得到两个打印变量的答案 15。

Answer 1

... why I get two different numbers ...

除了通常的 float-point 问题外， b 和 c 的计算路径以不同的方式到达。 c 是通过先将值保存为 double a.

来计算的

double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;

C 允许使用更广泛的类型计算中间 floating-point 数学。从 <float.h>.

检查 FLT_EVAL_METHOD 的值

Except for assignment and cast (which remove all extra range and precision), ...

-1 indeterminable;

0 evaluate all operations and constants just to the range and precision of the type;

1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;

2 evaluate all operations and constants to the range and precision of the long double type.

C11dr §5.2.4.2.2 9

OP

通过在 double a = (Vmax-Vmin)/step; 中保存商，精度被强制为 double 而 int b = (Vmax-Vmin)/step; 可以计算为 long double.

这种细微的差别是由于 (Vmax-Vmin)/step（可能计算为 long double）被保存为 double 与保留 long double。一个为 15（或略高于 15），另一个略低于 15。int 截断将此差异放大到 15 和 14。

在另一个编译器上，由于 FLT_EVAL_METHOD < 2 或其他 floating-point 特性，结果可能相同。

从 floating-point 数字到 int 的转换对于接近整数的数字非常严格。通常 round() 或 lround() 更好。最佳解决方案视情况而定。

Answer 2

"simple" 答案是那些 seemingly-simple 数字 2.9、1.4 和 0.1 在内部都表示为二进制浮点数，在二进制中，数字 1/10 表示为 infinitely-repeating 二进制分数 0.00011001100110011...[2] . （这类似于十进制的 1/3 最终变为 0.333333333...。）转换回十进制，这些原始数字最终变为 2.8999999999、1.3999999999 和 0.0999999999。当你对它们进行额外的计算时，那些 .0999999999 往往会激增。

然后另一个问题是你计算某些东西的路径——无论你是将它存储在特定类型的中间变量中，还是计算它 "all at once"，这意味着处理器可能使用内部寄存器比类型 double 更精确——最终会产生显着差异。

最重要的是，当您将 double 转换回 int 时，您几乎总是希望舍入，而不是截断。这里发生的事情是（实际上）一个计算路径给你 15.0000000001 截断为 15，而另一个给你 14.999999999 一直截断为 14.

另见 question 14.4a in the C FAQ list。

Answer 3

在analysis of C programs for FLT_EVAL_METHOD==2中分析了一个等价问题。

如果FLT_EVAL_METHOD==2:

double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;

通过计算 long double 表达式然后将其截断为 int 来计算 b，而对于 c 它是从 long double 求值，将其截断为double 然后到 int.

所以这两个值不是通过相同的过程获得的，这可能会导致不同的结果，因为浮动类型不提供通常的精确算法。

Answer 4

这确实是一个有趣的问题，这是您的硬件中发生的事情。这个答案给出了精确的计算，精度为 IEEE double 精度浮点数，即 52 位尾数加上一个隐式位。有关表示的详细信息，请参阅 wikipedia article.

好的，所以你首先定义一些变量：

double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;

相应的二进制值将是

Vmax =    10.111001100110011001100110011001100110011001100110011
Vmin =    1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010

如果你计算位数，你会看到我已经给出了设置的第一个位加上右边的 52 位。这正是您的计算机存储 double 的精度。 请注意 step 的值已四舍五入。

现在你对这些数字做一些数学运算。第一个运算，减法，得到精确的结果：

 10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
  1.1000000000000000000000000000000000000000000000000000

然后除以 step，它已被编译器四舍五入：

   1.1000000000000000000000000000000000000000000000000000
 /  .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111

由于 step 的四舍五入，结果略低于 15。与以前不同，我 not 立即四舍五入，因为这正是有趣的事情发生的地方：你的 CPU 确实可以存储比 [=16= 精度更高的浮点数], 所以四舍五入不会立即发生。

因此，当您将 (Vmax-Vmin)/step 的结果直接转换为 int 时，您的 CPU 只是切断小数点后的位（这就是隐式 double -> int 转换由语言标准定义):

               1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110

但是，如果您先将结果存储在双精度类型的变量中，则会进行舍入：

               1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded:       1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111

这正是你得到的结果。

将双精度数分配给 C 中的 int 变量的非直观结果

Nonintuitive result of the assignment of a double precision number to an int variable in C

c

floating-point

type-conversion

implicit-conversion