如何保持 int64_t = int64_t * float 的精度？

Question

我想通过 [0.01..1.2] 范围内的因子对 int64_t 进行校正，精度约为 0.01。天真的实现是：

int64_t apply_correction(int64_t y, float32_t factor)
{
    return y * factor;
}

不幸的是，如果我将 factor 转换为 int32 或将 y 转换为 float，我将失去精度。

但是，如果我能确保 y 的最大值低于 1<<56，我可以使用这个技巧：

(1<<8) * (y / (int32_t)(factor * (1<<8)))

如果我的输入值可以大于1<<56，我该如何解决这个问题？

剧情转折：

我运行使用 32 位架构，其中 int64_t 是模拟类型，我不支持双精度。该架构是 Analog Devices 的 SHARC。

Answer 1

用整数 space 怎么样？

/* factor precision is two decimal places */
int64_t apply_correction(int64_t y, float32_t factor)
{
    return y * (int32_t)(factor * 100) / 100;
}

这确实假设 y 不是很接近最大值，但它比 56 位更接近。

Answer 2

只是不要使用浮点数。

int64_t apply_correction(int64_t y, float32_t factor)
{
  int64_t factor_i64 = factor * 100f;

  return (y * factor_i64) / 100ll;
}

这是假设 y * factor_i64 * 100 不会溢出。

Answer 3

如果您计算 ((int64_t)1 << 57) * 100 或 * 256，您将有一个有符号整数溢出，这将导致您的代码具有未定义的行为。相反，如果您使用 uint64_t 和该值，那么您的代码将定义明确但行为异常。

然而，可以使这个工作几乎达到 (1 << 63 / 1.2)。

如果 y 是一个 uint64_t 您可以将原始数字拆分为右移 32 的最高有效 32 位和最低有效 32 位，将其乘以 (int32_t)(factor * (1 << 8)).

那你乘完后不把最高位右移8位，而是左移24位；然后相加：

uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
{
    uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 8));
    uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 8));     
    return (most_significant << 24) + (least_significant >> 8);
}

现在，apply_uint64_correction(1000000000000, 1.2) 将导致 1199218750000，而 apply_uint64_correction(1000000000000, 1.25) 将导致 1250000000000。

其实如果你能保证factor:

的范围，你可以提高它的精度

uint64_t apply_uint64_correction(uint64_t y, float32_t factor)
{
    uint64_t most_significant = (y >> 32) * (uint32_t)(factor * (1 << 24));
    uint64_t least_significant = (y & 0xFFFFFFFFULL) * (uint32_t)(factor * (1 << 24));     
    return (most_significant << 8) + (least_significant >> 24);
}

apply_uint64_correction(1000000000000, 1.2) 在我的电脑上会给出 1200000047683；如果 float32_t 有 24 位尾数，这也是您可以从中得到的最大精度。

上述算法也适用于带符号的正数，但由于负数的带符号移位是灰色区域，我会记下符号，然后将值转换为 uint64_t，执行便携式计算，如果原始符号为负，则取反。

int64_t apply_correction(int64_t y, float32_t factor) {
    int negative_result = 0;
    uint64_t positive_y = y;
    if (y < 0) {
        negative_result = 1;
        positive_y = -y;
    }

    uint64_t result = apply_uint64_correction(positive_y, factor);
    return negative_result ? -(int64_t)result : result;
}

如何保持 int64_t = int64_t * float 的精度？

How to keep precision on int64_t = int64_t * float?

c

integer-division

single-precision