将 uint64_t 除以 numeric_limits<uint64_t>::max() 为浮点表示

Question

给定一个 uint64_t 值，是否可以将它除以 std::numeric_limits<uint64_t>::max() 以便该值的浮点表示形式（0.0 到 1.0 表示 0 到 2^64-1)?

大于 max 的数字可以归结为未定义的行为，只要每个等于或小于 max 的数字都正确除以其浮点数 "counterpart"（或浮点类型最接近的数字）能够代表而不是真实价值）

我不确定将一侧（或两侧）转换为 long double 是否会为所有有效输入生成正确的值，因为标准不保证 long double 的尾数为64 位。这可能吗？

Answer 1

我认为最简单、最严格的便携方式是 boost::multiprecision::cpp_bin_float_quad:

#include <boost/multiprecision/cpp_bin_float.hpp>

#include <limits>
#include <cstdint>
#include <iostream>
#include <iomanip>


int main()
{
    using Float = boost::multiprecision::cpp_bin_float_quad;

    for (std::uint64_t i = 0 ; i < 64 ; ++i)
    {
        auto v = std::uint64_t(1) << i;
        auto x = Float(v);

        x /= std::numeric_limits<std::uint64_t>::max();

        // demonstrate lossless round-trip
        auto y = x * std::numeric_limits<std::uint64_t>::max();

        std::cout << std::setprecision(std::numeric_limits<Float>::digits10)
        << (x * 100) << "% : "
        << std::hex << y.convert_to<std::uint64_t>()
        << std::endl;
    }
}

预期结果：

5.42101086242752217033113759205528e-18% : 1
1.08420217248550443406622751841106e-17% : 2
2.16840434497100886813245503682211e-17% : 4
4.33680868994201773626491007364422e-17% : 8
8.67361737988403547252982014728845e-17% : 10
1.73472347597680709450596402945769e-16% : 20
3.46944695195361418901192805891538e-16% : 40
6.93889390390722837802385611783076e-16% : 80
1.38777878078144567560477122356615e-15% : 100
2.7755575615628913512095424471323e-15% : 200
5.55111512312578270241908489426461e-15% : 400
1.11022302462515654048381697885292e-14% : 800
2.22044604925031308096763395770584e-14% : 1000
4.44089209850062616193526791541169e-14% : 2000
8.88178419700125232387053583082337e-14% : 4000
1.77635683940025046477410716616467e-13% : 8000
3.55271367880050092954821433232935e-13% : 10000
7.1054273576010018590964286646587e-13% : 20000
1.42108547152020037181928573293174e-12% : 40000
2.84217094304040074363857146586348e-12% : 80000
5.68434188608080148727714293172696e-12% : 100000
1.13686837721616029745542858634539e-11% : 200000
2.27373675443232059491085717269078e-11% : 400000
4.54747350886464118982171434538157e-11% : 800000
9.09494701772928237964342869076313e-11% : 1000000
1.81898940354585647592868573815263e-10% : 2000000
3.63797880709171295185737147630525e-10% : 4000000
7.27595761418342590371474295261051e-10% : 8000000
1.4551915228366851807429485905221e-09% : 10000000
2.9103830456733703614858971810442e-09% : 20000000
5.8207660913467407229717943620884e-09% : 40000000
1.16415321826934814459435887241768e-08% : 80000000
2.32830643653869628918871774483536e-08% : 100000000
4.65661287307739257837743548967072e-08% : 200000000
9.31322574615478515675487097934145e-08% : 400000000
1.86264514923095703135097419586829e-07% : 800000000
3.72529029846191406270194839173658e-07% : 1000000000
7.45058059692382812540389678347316e-07% : 2000000000
1.49011611938476562508077935669463e-06% : 4000000000
2.98023223876953125016155871338926e-06% : 8000000000
5.96046447753906250032311742677853e-06% : 10000000000
1.19209289550781250006462348535571e-05% : 20000000000
2.38418579101562500012924697071141e-05% : 40000000000
4.76837158203125000025849394142282e-05% : 80000000000
9.53674316406250000051698788284564e-05% : 100000000000
0.000190734863281250000010339757656913% : 200000000000
0.000381469726562500000020679515313826% : 400000000000
0.000762939453125000000041359030627651% : 800000000000
0.0015258789062500000000827180612553% : 1000000000000
0.00305175781250000000016543612251061% : 2000000000000
0.00610351562500000000033087224502121% : 4000000000000
0.0122070312500000000006617444900424% : 8000000000000
0.0244140625000000000013234889800848% : 10000000000000
0.0488281250000000000026469779601697% : 20000000000000
0.0976562500000000000052939559203394% : 40000000000000
0.195312500000000000010587911840679% : 80000000000000
0.390625000000000000021175823681358% : 100000000000000
0.781250000000000000042351647362715% : 200000000000000
1.56250000000000000008470329472543% : 400000000000000
3.12500000000000000016940658945086% : 800000000000000
6.25000000000000000033881317890172% : 1000000000000000
12.5000000000000000006776263578034% : 2000000000000000
25.0000000000000000013552527156069% : 4000000000000000
50.0000000000000000027105054312138% : 8000000000000000

boost::multiprecision::float128 但是你会得到更好的性能，它只适用于 gcc（指定 -std=g++NN）或 intel 编译器。

Answer 2

不需要多精度运算。在使用 少于 64 位作为有效数 （又名尾数）除以 n_max=std::numeric_limits<uint64_t>::max() 的浮点运算中，可以计算以完全舍入的方式（即计算结果将与目标浮点格式中精确算术比率的最接近近似值相同）如下：

n/n_max = n/(2⁶⁴-1) = n/2⁶⁴/(1-2^-64) = n/2⁶⁴*(1+2^-64+2^-128+...) = n/2⁶⁴ + whatever doesn't fit in the significand

因此结果是

n/n_max = n/2⁶⁴

以下 C++ 测试程序实现了计算比率 n/n_max 的朴素和准确方法：

#include <climits>
#include <cmath>
#include <iostream>
#include <limits>
#include <type_traits>


template<typename F, typename U>
F map_to_unit_range_naive(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    return F(n)/F(std::numeric_limits<U>::max());
}

template<typename F, typename U>
F map_to_unit_range_accurate(U n)
{
    static_assert(std::is_floating_point<F>::value, "Result type must be a floating point type");
    static_assert(std::is_unsigned<U>::value, "Input type must be an unsigned integer type");
    const int UBITS = sizeof(U) * CHAR_BIT;
    return std::ldexp(F(n), -UBITS);
}

template<class F, class U>
double error_mapping_to_unit_range(U n)
{
    const F r1 = map_to_unit_range_accurate<F>(n);
    const F r2 = map_to_unit_range_naive<F>(n);
    return (1-r2/r1);
}

#define CHECK_MAPPING_TO_UNIT_RANGE(n, result_type)                     \
    std::cout << "map_to_unit_range<" #result_type ">(" #n "): err="    \
              << error_mapping_to_unit_range<result_type>(n)*100 << "%" \
              << std::endl;

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,         float);
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u,  float);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, float);
    std::cout << "\n";
    CHECK_MAPPING_TO_UNIT_RANGE(123ul,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890ul, double);
    return 0;
}

程序证明朴素的方法与精心编写的代码不相上下：

map_to_unit_range<float>(123u): err=0%
map_to_unit_range<float>(123ul): err=0%
map_to_unit_range<float>(1234567890u): err=0%
map_to_unit_range<float>(1234567890ul): err=0%

map_to_unit_range<double>(123ul): err=0%
map_to_unit_range<double>(1234567890ul): err=0%

乍一看这似乎令人惊讶，但它有一个简单的解释——如果浮点类型不能准确表示整数值 2^N-1，则将其四舍五入为2^N，有效导致下一步精确除法（根据上述公式）。

注意当浮点型的精度超过整数型的大小时（使得2^N-1可以精确表示）公式的前提是未满足，"accurate" 方法不再是这样：

int main()
{
    CHECK_MAPPING_TO_UNIT_RANGE(123u,        double);
    CHECK_MAPPING_TO_UNIT_RANGE(1234567890u, double);
    return 0;
}

输出：

map_to_unit_range<double>(123u): err=-2.32831e-08%
map_to_unit_range<double>(1234567890u): err=-2.32831e-08%

这里的"error"来自"accurate"方法

学分：

非常感谢 @interjay and @Jonathan Mee 对本答案的先前版本进行了全面的同行评审。

Answer 3

我想从你的问题中暗示：

I'm not sure casting one (or both) sides to long double will result in correct values for all valid inputs, because the standard doesn't guarantee long double to have a mantissa of 64 bits. Is this possible at all?

你问的是：

`uint64_t` 表示的任何值能否在被转换为 `long double` 的尾数并返回到 `uint64_t` 的往返过程中幸存下来？

答案取决于实现。关键在于 long double 的尾数有多少位。幸运的是，C++11 为您提供了一种方法：numeric_limits<long double>::digits 例如：

const auto ui64max = numeric_limits<uint64_t>::max();
const auto foo = ui64max - 1;
const auto bar = static_cast<long double>(foo) / ui64max;

cout << "Max Digits For Roundtrip Guarantee: " << numeric_limits<long double>::digits << "\nMax Digits In uint64_t: " << numeric_limits<uint64_t>::digits << "\nConverting: " << foo << "\nTo long double Mantissa: " << bar << "\nRoundtrip Back To uint64_t: " <<  static_cast<uint64_t>(bar * ui64max) << endl;

Live Example

您可以在编译时使用类似以下内容验证这一事实：

static_assert(numeric_limits<long double>::digits >= numeric_limits<uint64_t>::digits, "long double has insufficient mantissa precision in this implementation");

有关数学支持往返问题的更多信息，您可以在此处查看：Float Fractional Precision

将 uint64_t 除以 numeric_limits<uint64_t>::max() 为浮点表示

Dividing uint64_t by numeric_limits<uint64_t>::max() to a floating point representation

c++

precision

floating-accuracy

n/n_max = n/2⁶⁴

`uint64_t` 表示的任何值能否在被转换为 `long double` 的尾数并返回到 `uint64_t` 的往返过程中幸存下来？

将 uint64_t 除以 numeric_limits<uint64_t>::max() 为浮点表示

Dividing uint64_t by numeric_limits<uint64_t>::max() to a floating point representation

c++

precision

floating-accuracy

n/nmax = n/264

uint64_t 表示的任何值能否在被转换为 long double 的尾数并返回到 uint64_t 的往返过程中幸存下来？

n/n_max = n/2⁶⁴

`uint64_t` 表示的任何值能否在被转换为 `long double` 的尾数并返回到 `uint64_t` 的往返过程中幸存下来？