C++ 中浮点数据类型的精度

Question

为什么浮点数据类型的精度不与其大小成比例地增长？例如：

std::cout << sizeof(float) << "\n";  // this gives 4 on my machine "debian 64 bit" with "gcc 6.3.0"  
std::cout << std::numeric_limits<float>::digits10  << "\n"; // gives 6

std::cout << sizeof(double) << "\n";  // gives 8
std::cout << std::numeric_limits<double>::digits10 <<  "\n"; // gives 15

std::cout << sizeof(long double) <<  "\n";  // gives 16
std::cout << std::numeric_limits<long double>::digits10  << "\n"; // gives 18

如您所见，double 的精度大约是 float 精度的两倍，这是有道理的，因为 double 的大小是 [=12] 的两倍=].

但是double和long double不一样，long double的大小是128位，是64位的两倍double, 但它的精度只多了三位数!!

我不知道浮点数是如何实现的，但从理性的角度来看，为了三位数的精度而多使用 64 位内存是否有意义？！

我四处搜索，但未能找到简单直接的答案。如果有人能解释为什么 long double 的精度只比 double 多三位数，你能不能也解释一下为什么这与 double 和 float 之间的情况不同？

而且我还想知道如何在不定义我自己的数据类型的情况下获得更好的精度，这显然会以牺牲性能为代价？

Answer 1

"Precision" 并不是浮点值的全部。它还与 "magnitude" 有关（虽然不确定该术语是否正确！）：表示的值可以变成多大（或小）？

为此，尝试打印每种类型的 max_exponent：

std::cout << "float: " << sizeof(float) << "\n";
std::cout << std::numeric_limits<float>::digits << "\n";
std::cout << std::numeric_limits<float>::max_exponent << "\n";

std::cout << "double: " << sizeof(double) << "\n";
std::cout << std::numeric_limits<double>::digits << "\n";
std::cout << std::numeric_limits<double>::max_exponent << "\n";

std::cout << "long double: " <<  sizeof(long double) << "\n";
std::cout << std::numeric_limits<long double>::digits << "\n";
std::cout << std::numeric_limits<long double>::max_exponent << "\n";

ideone 上的输出：

float: 4
24
128
double: 8
53
1024
long double: 16
64
16384

所以多余的bits并不是全部用来表示更多的位数（精度）而是让指数更大。使用 IEE 754 long double 中的措辞主要增加 指数范围 而不是精度。

上面我的 ideone 示例显示的格式（可能）显示 "x86 extended precision format" 为整数部分分配 1 位，为小数部分分配 63 位（总共 64 位）和 15 位（2 ^(15-1) = 16384, 1 bit used for the sign of the exponent) for the exponent.

请注意，C++ 标准只要求 long double 至少与 double 一样精确，因此 long double 可以是 double 的同义词，如图所示x86 扩展精度格式（最有可能）或更好（AFAIK 仅限 PowerPC 上的 GCC）。

And I also want to know how can I get better precision, without defining my own data type which obviously going to be at expense of performance?

您需要自己编写（当然是学习经验，最好不要为生产代码编写）或使用库，例如 GNU MPFR or Boost.Multiprecision。

Answer 2

C++ 标准没有为浮点类型设定固定要求，除了它们必须满足的一些最低级别。

您使用的 C++ 实现可能针对 Intel 处理器。除了常见的 IEEE-754 基本 32 位和 64 位二进制浮点格式外，Intel 还有一种 80 位格式。您的 C++ 实现可能将其用于 long double.

Intel 的 80 位格式比 64 位 double 格式的有效位多 11 位。（它实际上使用 64，而 double 格式使用 52，但其中一个保留用于显式前导 1。）多 11 位意味着 2¹¹=2048 倍有效数值，大约是三位小数。

80 位格式（十字节）优先对齐到 16 字节的倍数，因此包含六个字节的填充以使 long double 大小成为 16 字节的倍数。

Answer 3

你的问题有很多不正确的假设

首先，在 C++ 中没有关于类型大小的要求。该标准仅规定了每种类型的 最小精度 并且...

... The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf

大多数现代实现将 float 和 double 映射到 IEEE-754 单精度和双精度 格式，因为对它们的硬件支持是主流。然而 long double 并没有得到如此广泛的支持，因为很少有人需要比 double 更高的精度，而且这些硬件的成本要高得多。因此一些平台将其映射为 IEEE-754 双精度，即与 double 相同。其他一些人将其映射到 80-bit IEEE 754 extended-precision format if the underlying hardware supports it. Otherwise long double will be represented by double-double arithmetic or IEEE-754 quadruple-precision

此外，精度也不与类型中的位数成线性比例关系。很容易看出 double 是 两倍多 与 float 和 8 倍范围 比 float 尽管只有两倍的存储空间，因为它有 53 位有效数，而浮点数为 24 位，指数位多 3 位。类型也可以有 trap representations 或填充位，因此不同的类型可能有不同的范围，即使它们具有相同的大小并且属于相同的类别（整数或浮点数）

所以这里重要的是 std::numeric_limits<long double>::digits. If you print that you'll see that long double has 64 bits of significand which is just 11 bits more than double. See it live. That means your compiler uses the 80-bit extended-precision for long double, the rest is just padding bytes to keep the alignment. In fact gcc has various options 这将改变你的输出：

-malign-double 和 -mno-align-double 用于控制 long double
-m96bit-long-double 和 -m128bit-long-double 用于更改填充大小
-mlong-double-64、-mlong-double-80 和 -mlong-double-128 用于控制底层 long double 实现

通过更改选项，您将获得 long double

的以下结果

-mlong-double-128: 大小 = 16, digits10 = 33, digits2 = 113
-m96bit-long-double：大小 = 12，数字 10 = 18，数字 2 = 64
-mlong-double-64：大小 = 8，数字 10 = 15，数字 2 = 53

如果禁用填充，您将获得 size = 10，但由于未对齐，这会以性能为代价。在 compiler explorer

上查看更多演示

在 PowerPC 中您也可以 see the same phenomena when changing the floating-point format。使用 -mabi=ibmlongdouble （双精度运算，这是默认值）你会得到 (size, digits10, digits2) = (16, 31, 106) 但使用 -mabi=ieeelongdouble 元组将变为 (16, 33, 113)

有关更多信息，您应该阅读 https://en.wikipedia.org/wiki/Long_double

And I also want to know how can I get better precision, without defining my own data type

要搜索的关键字是arbitrary-precision arithmetic. There are various libraries for that which you can find in the List of arbitrary-precision arithmetic software. You can find more information in the tags bigint, biginteger or arbitrary-precision

C++ 中浮点数据类型的精度

Precision of floating-point data types in C++

c++

floating-point

precision

gcc

long-double