如何在仅支持 32 位浮点数的平台上将 IEEE754 64 位双精度数除以 1000？

Question

我有一个电表通过 PROFIBUS 连接到 DCS（分布式控制系统）。仪表 (Siemens Sentron PAC3200) 以 IEEE 754 的形式提供其计数，单位为 Wh（瓦时）。此外，计数器在 1.0e12 Wh 或 1,000 GWh 时溢出。（剖视图：几年前，西门子开发实验室。"Let's see, how to transfer a 40-bit unsigned integer value? Let's use double!"）

我的目标是以千瓦时精度一致地记录计数。

然而，DCS 仅支持单精度浮点数。因此，如果我采取直接路线，即将数据压缩成一个浮点数，那么在千瓦时读数中会出现大约七位十进制数字的错误，即最迟从大约 100,000,000 Wh 或 100 MWh。目前的计数已经是600MWh，所以这不是可行的方法。

所以现在，我将尾数放入一个无符号双精度整数（UDINT，在该平台上为 32 位）并根据 IEEE 754 执行转换，从而产生以 Wh 为单位的正确值。然而，这需要 2^32 Wh 或大约 4.3 GWh 的溢出，这将持续我们不到十年。

由于我只需要 kWh 精度，所以我在转换初期就想到了除以 1000。这将使变量溢出到 4,300 GWh，而电表的内部计数器已经溢出到 1,000 GWh。理论上问题解决了。

然而，由于 IEEE 754 是一种二进制浮点格式，我只能很容易地除以 1024（右移 10 次），这会引入很大的错误。之后乘以校正因子 1.024 只会在此平台上以单精度发生，从而使之前的努力无效。

另一种选择是从转换中以 Wh 输出 "high" 和 "low" UDINT，然后我至少在理论上可以计算回 kWh，但这看起来很尴尬（而且 -ful ).

我有一种微妙的感觉，我可能忽略了一些东西（可以这么说，一个人的集体思考）；我愿意接受任何其他想法如何获得转移的双倍价值的 1/1000。

谢谢和最诚挚的问候

比约恩

P.S.：为了您的观赏乐趣，这是基于@EricPostpischil 的回答的解决方案——针对平台和任务的具体情况量身定制。使用的语言是符合 EN 61131-3 的 SCL（结构化控制语言），这是一种 Pascal 方言。

FUNCTION_BLOCK PAC3200KON_P

VAR_INPUT
    INH : DWORD;
    INL : DWORD;
END_VAR

VAR_OUTPUT
    OUT : UDINT;
    SGN : BOOL;
END_VAR

VAR
    significand:              UDINT;
    exponent, i, shift:       INT;
    sign:                     BOOL;
    d0, d1, y0, y1, r1, temp: DWORD;
END_VAR
(*
    Convert the energy count delivered by Siemens Sentron PAC3200
    (IEEE 754 binary64 format, a.k.a. double) into an UDINT.

    Peculiarities:
    - This hardware platform only supports binary32 (a.k.a. float).

    - The Sentron's internal counter overflows at 1.0e12 Wh (1000 GWh).

    - kWh resolution suffices.

    - If you converted the double directly to UDINT and divided by 1000
      afterwards, the range would be reduced to (2^32-1)/1000 GWh or about
      4.295 GWh.

    - This is why this function first divides the significand by 1000
      and then proceeds with conversion to UDINT. This expands the
      range to (2^32-1) GWh or about 4295 GWh, which isn't reachable in
      practice since the device's internal counter overflows before.

    Background:

    IEEE 754 binary64 bit assignment:

               High-Byte                         Low-Byte
    66665555555555444444444433333333 3322222222221111111111
    32109876543210987654321098765432 10987654321098765432109876543210
    GEEEEEEEEEEESSSSSSSSSSSSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

    G: sign (1: negative)
    E: exponent (biased; subtract 1023) (11 bits)
    S: significand (52 bits)
*)

(*
    significand: Bits 19...0 of high byte und complete low byte

    The significand is initially divided by 1000 using integer division. The
    bits are divided into two parts:

    - d1 contains the 31 most significant bits (plus leading 1)
    - d0 contains the next less significant bits

    In total, we use 48 bits of the original significand.
*)

(* d1: insert significand bits from high byte *)
d1 := INH AND     2#0000_0000_0000_1111_1111_1111_1111_1111;
(* result:        2#0000_0000_0000_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* add the 1 before the binary point *)
d1 := d1 OR       2#0000_0000_0001_0000_0000_0000_0000_0000;
(* result:        2#0000_0000_0001_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* "flush left" shift 11 places *)
d1 := d1 * 2048;
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_H000_0000_0000 *)

(* Insert another 11 bits from low byte (msb ones) *)
d1 := d1 OR (INL / 2097152);
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL *)

(* Base-65536 division. Integer divide by 1000 and save remainder *)
y1 := d1 / 1000;
r1 := TO_DW(TO_UD(d1) MOD 1000);

(*
   The significand now has leading zeroes. Shift left to make space
   at the other end.
*)
FOR shift := 1 TO 31 BY 1 DO
    y1 := y1 * 2;
    IF (y1 AND 2#1000_0000_0000_0000_0000_0000_0000_0000) <> 0 THEN
        EXIT;
    END_IF;
END_FOR;

(*
   d0: insert next 16 bits from the low byte
   (right shift five times and zero out the leading places)
*)
(* bits:             2#xxxx_xxxx_xxxL_LLLL_LLLL_LLLL_LLLx_xxxx *)
d0 := (INL / 32) AND 2#0000_0000_0000_0000_1111_1111_1111_1111;
(* result:           2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL *)

(* Now divide by 1000, factoring in remainder from before *)
y0 := ((r1 * 65536) OR d0) / 1000;

(*
   y1 and y0 contain results from division by 1000. We'll now build a 32 bit
   significand from these.

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx
   y0 = 2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL

   y1 has an uncertain number of zeroes at its end, resulting from the above
   left shifting (number of steps inside variable "shift"). Fill those with the
   most significant bits from y0.

   y0 has 16 valid bits (0..15). Shift right so that the "highest place zero"
   in y1 corresponds with the MSB from y0. (shift by 16-shift)

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx (ex.: shift=10)
   y0 = 2#0000_0000_0000_0000_0000_00LL_LLLL_LLLL
                              ------>^
*)

FOR i := 1 TO 16 - shift BY 1 DO
    y0 := y0 / 2;
END_FOR;

significand := TO_UD(y1 OR y0);
(* Result: 32-bit significand *)

(*
    Exponent: bits (62-32)...(59-32) or bits 30...20 of high byte, respectively

    Coded with bias of 1023 (needs to be subtracted).

    Special cases as per standard:
    - 16#000: signed zero or underflow (map to zero)
    - 16#7FF: inifinite or NaN (map to overflow)
*)
temp := 2#0111_1111_1111_0000_0000_0000_0000_0000 AND INH;
temp := temp / 1048576 ; (* right shift 20 places (2^20) *)
exponent := TO_IN(TO_DI(temp));
exponent := exponent - 1023; (* remove bias *)

(*
   Above, we already left shifted "shift" times, which needs to be taken into
   account here by shifting less.
*)
exponent := exponent - shift;

(*
    The significand will be output as UDINT, but was initially a binary64 with
    binary point behind the leading 1, after which the coded exponent must be
    "executed".

    temp = 2#1.HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL

    As UDINT, this already corresponds to a 31-fold left shift.

    Exponent cases as per IEEE 754:

    - exponent < 0:            result < 1
    - exponent = 0:       1 <= result < 2
    - exponent = x > 0: 2^x <= result < 2^(x+1)

    The UDINT output (32 bit) allows us to represent exponents right up to 31.
    Everything above is mapped to UDINT's maximum value.

    Now determine, after the de facto 31-fold left shift, what shifts remain
    "to do".
*)

IF exponent < 0 THEN
    (* underflow: < 2^0 *)
    significand := 0;
ELSIF exponent > 31 THEN
    (* overflow: > 2^32 - 1 *)
    significand := 4294967295;
ELSE
    (*
        result is significand * 2^exponent or here, as mentioned above,
        significand * 2^(31-exponent).

        The loop index i is the "shift target" after loop execution, which is
        why it starts at 31-1.

        Example: exponent = 27, but de facto we've already got a shift of 31.
        So we'll shift back four times to place the binary point at the right
        position (30, 29, 28, 27):

        before: temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL.

        after:  temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL.LLLL
                                                           ^<---|
    *)
    FOR i := 30 TO exponent BY -1 DO
        significand := significand / 2;
    END_FOR;
END_IF;

(*
    sign: bit 63 of high byte
*)
sign := (2#1000_0000_0000_0000_0000_0000_0000_0000 AND INH) <> 0;

OUT := significand;
SGN := sign;

END_FUNCTION_BLOCK

我使用的测试数据：

  high byte     low byte  decimal value
=======================================
16#41c558c3, 16#2d3f331e,       716_277
16#41EFFFFF, 16#5E000000,     4_294_966
16#41EFFFFF, 16#DB000000,     4_294_967
16#41F00000, 16#2C000000,     4_294_968
16#426D1A94, 16#A1830000,   999_999_999
16#426D1A94, 16#A2000000, 1_000_000_000
16#426D1A94, 16#A27D0000, 1_000_000_001
16#428F3FFF, 16#FFC18000, 4_294_967_294
16#428F3FFF, 16#FFE0C000, 4_294_967_295
16#428F4000, 16#00000000, 4_294_967_296

顺便说一句，SCL 中 b#1234 形式的整数文字基本上表示 "the number 1234 in base b"。下划线被忽略（它们是数字分隔符以提高可读性，例如 Python 有）。

Answer 1

/*  This program shows two methods of dividing an integer exceeding 32 bits
    by 1000 using unsigned 32-bit integer arithmetic.
*/


#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>


/*  If the count is less than 2**35, we can shift three bits (divide by 8) and
    then divide by 125 using 32-bit unsigned arithmetic.
*/
static uint32_t ShiftThenDivide(uint64_t x)
{
    uint32_t y = x >> 3;
    return y / 125;
}


/*  Given any count less than 1000*2**32 (which exceeds the 2**40 requirement),
    we can perform long division in radix 65536.
*/
static uint64_t LongDivision(uint64_t x)
{
    /*  Set d1 to the high two base-65536 digits (bits 17 to 31) and d0 to
        the low digit (bits 0 to 15).
    */
    uint32_t d1 = x >> 16, d0 = x & 0xffffu;

    //  Get the quotient and remainder of dividing d1 by 1000.
    uint32_t y1 = d1 / 1000, r1 = d1 % 1000;

    /*  Combine the previous remainder with the low digit of the dividend and
        divide by 1000.
    */
    uint32_t y0 = (r1<<16 | d0) / 1000;

    //  Return a quotient formed from the two quotient digits.
    return y1 << 16 | y0;
}


static void Test(uint64_t x)
{
    //  Use 64-bit arithmetic to get a reference result.
    uint32_t y0 = x / 1000;

    //  ShiftThenDivide only works up to 2**35, so only test up to that.
    if (x < UINT64_C(1) << 35)
    {
        uint32_t y1 = ShiftThenDivide(x);
        if (y1 != y0)
        {
            printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but ShiftThenDivide produces 0x%" PRIx32 ".\n",
                x, y0, y1);
            exit(EXIT_FAILURE);
        }
    }

    //  Test LongDivision.
    uint32_t y2 = LongDivision(x);
    if (y2 != y0)
    {
        printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but LongDivision produces 0x%" PRIx32 ".\n",
            x, y0, y2);
        exit(EXIT_FAILURE);
    }
}


int main(void)
{
    srandom(time(0));

    //  Test all possible values for the upper eight bits.
    for (uint64_t upper = 0; upper < 1<<8; ++upper)
    {
        //  Test some edge cases.
        uint64_t x = upper << 32;
        Test(x);
        Test(x+1);
        Test(x-1 & 0xffffffffffu);
            /*  When x is zero, x-1 would wrap modulo 2**64, but that is
                outside our supported domain, so wrap modulo 2**40.
            */

        //  Test an assortment of low 32 bits.
        for (int i = 0; i < 1000; ++i)
        {
            uint32_t r0 = random() & 0xffffu, r1 = random() & 0xffffu;
            uint64_t lower = r1 << 16 | r0;
            Test(x | lower);
        }
    }
}

Answer 2

我会以稍微不同的方式解决这个问题。由于OP没有提到任何使用的编程语言，我在这里写下一些伪代码。我将假设 binary64 浮点数作为 8 个字节的序列传递给编译器。我假设 OP 会在需要时处理字节顺序。

1.将binary64拆分为三个binary32浮点数：

一个binary64 floating-point数由一个符号位、11个指数位和52个表示有效位的位表示：

计算如下：

(−1)^b₆₃ (1 + Sum(b_52−i 2⁻ⁱ;i = 1 → 52 )) × 2^e−1023

一个binary32 floating-point数由一个符号位、8个指数位和32个表示有效位的位表示：

计算如下：

(−1)^b₃₁ (1 + Sum(b_23−i 2⁻ⁱ;i = 1 → 23 )) × 2^e−127

现在的想法是创建三个二进制 32 浮点数 f{1,2,3}，这样，当使用实数算术（无浮点近似值）时，给出二进制 64 浮点数 d通过：

d = f1 + f2 + f3

假设函数 EXTRACT(d,n,m) returns 从 binary64 位表示 d n 到 m 位中提取的整数：

function val Extract(d,n,m)
   val = Sum(b_52−i 2ⁿ⁻ⁱ;i = m → n )

和函数 Exponent(d) returns 二进制 64 位表示的值 e-1023 d。

那么我们就知道

f1 = (2^23 + Extract(d,1,23)) * 2^(Exponent(d) - 23)
f2 = Extract(d,24,46) * 2^(Exponent(d) - 46)
f3 = Extract(d,47,52) * 2^(Exponent(d) - 52)

2。将值除以 1000：

不幸的是，这说起来容易做起来难。众所周知，有限精度的计算意味着一些舍入误差，导致计算结果不准确。这正是我们在这里试图避免的。如果我们只计算

f1 * 1E-3 + f2 * 1E-3 + f3 * 1E-3

我们会引入舍入误差。

假设a和b是2个浮点数，函数fl(x)returns实数值的浮点数x a OP b表示基本运算+、-和*在实数运算中的全实数。由此，我们知道 a OP b != fl(a OP b) 作为实数并不总是可以完全用浮点数表示。但是，可以证明a OP b = fl(a OP b) + y和y是一个浮点数。这个y就是我们刚才计算f1 * fl(1E-3)时在上面的计算中会遗漏的错误。

因此，为了准确计算 d * fl(1E-3)，我们需要跟踪误差项。为此，我们将使用一些无错误转换，这些在论文Accurate summation, dot product and polynomial evaluation in complex floating-point arithmetic:

中有评论

# error-free transformation of the sum of two floating-point numbers
function [x,y] = TwoSum(a,b)
   x = a + b
   z = x - a
   y = ((a - (x - z)) + (b - z))
# Error-free split of a lfoating point number in two parts
function [x,y] Split(a)
   c = (2^12 - 1) * a
   x = (c - (c - a))
   y = a - x
# error-free transformation of the product of two floating-point numbers
function [x,y] = TwoProduct(a,b)
   x = a * b
   [a1,a2] = Split(a); [b1,b2] = Split(b)
   y = (a2*b2 - (((x - a1*b1) - a2*b1) - a1*b2))

3。完整功能：

因此，如果我们想使用 binary32 浮点运算重新缩放具有位表示 d 的 binary64 数，我们应该使用函数：

# rescale double-precision d by a single-precision a
function res = Rescale(d,a)
   # first term
   f = (2^23 + Extract(d,1,23)) * 2^(Exponent(d) - 23)
   [p,s] = TwoProduct(f,a)
   # second term
   f = Extract(d,24,46) * 2^(Exponent(d) - 46)
   [h,r] = TwoProduct(f,a)
   [p,q] = TwoSum(p,h)
   s = s + (q + r)       # the error term
   # third term
   f = Extract(d,47,52) * 2^(Exponent(d) - 52)
   [h,r] = TwoProduct(f,a)
   [p,q] = TwoSum(p,h)
   s = s + (q + r)       # the error term
   # the final result
   res = p + s

这将跟踪浮点数学中的所有数字错误并相应地补偿结果。因此，Rescale 返回的值 res 将代表 d/1000.

的最准确的单精度值

Answer 3

1e12Wh / 1kWh = 1e9.

一个 4 字节、32 位的 INT（有符号或无符号）给你多一点的有效数字。但您必须记住，它的单位是千瓦时，而不是瓦时。每次添加时，您都可能会遇到另一个舍入错误。
FLOAT 只有 ~7 位数的分辨率；你需要9。传感器发送一个FLOAT是可以的，但是在FLOAT.
[=中累加是不行的41=]
DOUBLE 是 8 个字节，~16 位有效数字。这会让你存储 Wh.

一个折衷方案是使用DOUBLE累加，但除以1000并存储在一个4字节的INT中。

存储在DOUBLE中有问题吗？除了采取额外的 space 之外，它基本上解决了所有问题 -- 不仅仅是足够的解决方案和防止舍入错误；能够存储 'natural' 单位的 Wh；等。如果可能，我会使用 DOUBLE。

（无符号 32 位整数将是第二选择。）

我什至不会考虑 40 位整数（或两个 32 位整数），因为使用起来笨拙并且可能导致移植困难。

跳出框框思考...

存储每年的小计。然后，当您需要总计时，请汇总小计。（对于面向数据库的解决方案，无论如何我都会这样做。）

如何在仅支持 32 位浮点数的平台上将 IEEE754 64 位双精度数除以 1000？

How could you divide an IEEE754 64-bit double by 1000 on a platform only supporting 32-bit float?

floating-point

precision

integer-overflow

ieee-754

floating-point-conversion