将 double 转换为 float 时如何确定舍入方向?
How to determine the rounding direction when casting a double to a float?
我正在寻找一种算法来确定将任意 64 位双精度数转换为 32 位浮点数时的舍入方向。我对此检查的具体用例是将 64 位双精度转换为 32 位浮点数并向无穷大舍入。
首先,我应用以下标准来检查尾数截断部分中的位。如果尾数的截断部分不为零,则强制转换必须向下舍入!
const F64_MASK: u64 = (1 << 29) - 1;
fn is_rounded_up_when_casted(v: f64) -> bool {
v.to_bits() & F64_MASK > 0
}
但是,此标准并未识别所有情况 -- 指数的最后三位也被截断。我也尝试修改掩码以检查这些指数位:
const F64_MASK: u64 = (1u64 << 55) - (1 << 52) + (1 << 29) - 1;
很遗憾,此检查无效。例如,数字 1.401298464324817e−45
有一个指数,其中三个截断位是 010
,但仍然精确表示在 float/f32.
中
编辑:我不认为可以说非零尾数意味着正舍入。我想我需要一种不同的方法。我认为指数只是增加了数字的范围,因此可以通过一些单独的检查来处理。舍入方向可能只是尾数截断部分前导位的函数?
无需修改位:
#[derive(PartialEq, std::fmt::Debug)]
enum Direction { Equal, Up, Down }
fn get_rounding_direction(v: f64) -> Direction {
match v.partial_cmp(&(v as f32 as f64)) {
Some(Ordering::Greater) => Direction::Down,
Some(Ordering::Less) => Direction::Up,
_ => Direction::Equal
}
}
以及一些检查正确性的测试。
#[cfg(test)]
#[test]
fn test_get_rounding_direction() {
// check that the f64 one step below 2 casts to exactly 2
assert_eq!(get_rounding_direction(1.9999999999999998), Direction::Up);
// check edge cases
assert_eq!(get_rounding_direction(f64::NAN), Direction::Equal);
assert_eq!(get_rounding_direction(f64::NEG_INFINITY), Direction::Equal);
assert_eq!(get_rounding_direction(f64::MIN), Direction::Down);
assert_eq!(get_rounding_direction(-f64::MIN_POSITIVE), Direction::Up);
assert_eq!(get_rounding_direction(-0.), Direction::Equal);
assert_eq!(get_rounding_direction(0.), Direction::Equal);
assert_eq!(get_rounding_direction(f64::MIN_POSITIVE), Direction::Down);
assert_eq!(get_rounding_direction(f64::MAX), Direction::Up);
assert_eq!(get_rounding_direction(f64::INFINITY), Direction::Equal);
// for all other f32
for u32_bits in 1..f32::INFINITY.to_bits() - 1 {
let f64_value = f32::from_bits(u32_bits) as f64;
let u64_bits = f64_value.to_bits();
if u32_bits % 100_000_000 == 0 {
println!("checkpoint every 600 million tests: {}", f64_value);
}
// check that the f64 equivalent to the current f32 casts to a value that is equivalent
assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
// check that the f64 one step below the f64 equivalent to the current f32 casts to a value that is one step greater
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
// check that the f64 one step above the f64 equivalent to the current f32 casts to a value that is one step less
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);
// same checks for negative numbers
let u64_bits = (-f64_value).to_bits();
assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
}
}
要具体地向无穷大舍入:
fn cast_toward_inf(vf64: f64) -> f32 {
let vf32 = vf64 as f32;
if vf64 > vf32 as f64 { f32::from_bits(vf32.to_bits() + 1) } else { vf32 }
}
可以主要从第 28 位(尾数截断部分的第一位)确定舍入,但处理边缘情况会带来很大的复杂性。
您发现的边缘情况与以下事实有关:次正规的 f32 值实际上可以表示小于其典型最小值的指数。我写了一个我认为涵盖所有边缘情况的函数:
const F64_MANTISSA_SIZE: u64 = 52;
const F64_MANTISSA_MASK: u64 = (1 << F64_MANTISSA_SIZE) - 1;
const F64_EXPONENT_SIZE: u64 = 64 - F64_MANTISSA_SIZE - 1;
const F64_EXPONENT_MASK: u64 = (1 << F64_EXPONENT_SIZE) - 1; // shift away the mantissa first
const F32_MANTISSA_SIZE: u64 = 23;
const F64_TRUNCATED_MANTISSA_SIZE: u64 = F64_MANTISSA_SIZE - F32_MANTISSA_SIZE;
const F64_TRUNCATED_MANTISSA_MASK: u64 = (1 << F64_TRUNCATED_MANTISSA_SIZE) - 1;
fn is_exactly_representable_as_f32(v: f64) -> bool {
let bits = v.to_bits();
let mantissa = bits & F64_MANTISSA_MASK;
let exponent = (bits >> F64_MANTISSA_SIZE) & F64_EXPONENT_MASK;
let _sign = bits >> (F64_MANTISSA_SIZE + F64_EXPONENT_SIZE) != 0;
if exponent == 0 {
// if mantissa == 0, the float is 0 or -0, which is representable
// if mantissa != 0, it's a subnormal, which is never representable
return mantissa == 0;
}
if exponent == F64_EXPONENT_MASK {
// either infinity or nan, all of which are representable
return true;
}
// remember to subtract the bias
let computed_exponent = exponent as i64 - 1023;
// -126 and 127 are the min and max value for a standard f32 exponent
if (-126..=127).contains(&computed_exponent) {
// at this point, it's only exactly representable if the truncated mantissa is all zero
return mantissa & F64_TRUNCATED_MANTISSA_MASK == 0;
}
// exponents less than 2**(-126) may be representable by f32 subnormals
if computed_exponent < -126 {
// this is the number of leading zeroes that need to be in the f32 mantissa
let diff = -127 - computed_exponent;
// this is the number of bits in the mantissa that must be preserved (essentially mantissa with trailing zeroes trimmed off)
let mantissa_bits = F64_MANTISSA_SIZE - (mantissa.trailing_zeros() as u64).min(F64_MANTISSA_SIZE) + 1;
// the leading zeroes + essential mantissa bits must be able to fit in the smaller mantissa size
return diff as u64 + mantissa_bits <= F32_MANTISSA_SIZE;
}
// the exponent is >127 so f32s can't go that high
return false;
}
我正在寻找一种算法来确定将任意 64 位双精度数转换为 32 位浮点数时的舍入方向。我对此检查的具体用例是将 64 位双精度转换为 32 位浮点数并向无穷大舍入。
首先,我应用以下标准来检查尾数截断部分中的位。如果尾数的截断部分不为零,则强制转换必须向下舍入!
const F64_MASK: u64 = (1 << 29) - 1;
fn is_rounded_up_when_casted(v: f64) -> bool {
v.to_bits() & F64_MASK > 0
}
但是,此标准并未识别所有情况 -- 指数的最后三位也被截断。我也尝试修改掩码以检查这些指数位:
const F64_MASK: u64 = (1u64 << 55) - (1 << 52) + (1 << 29) - 1;
很遗憾,此检查无效。例如,数字 1.401298464324817e−45
有一个指数,其中三个截断位是 010
,但仍然精确表示在 float/f32.
编辑:我不认为可以说非零尾数意味着正舍入。我想我需要一种不同的方法。我认为指数只是增加了数字的范围,因此可以通过一些单独的检查来处理。舍入方向可能只是尾数截断部分前导位的函数?
无需修改位:
#[derive(PartialEq, std::fmt::Debug)]
enum Direction { Equal, Up, Down }
fn get_rounding_direction(v: f64) -> Direction {
match v.partial_cmp(&(v as f32 as f64)) {
Some(Ordering::Greater) => Direction::Down,
Some(Ordering::Less) => Direction::Up,
_ => Direction::Equal
}
}
以及一些检查正确性的测试。
#[cfg(test)]
#[test]
fn test_get_rounding_direction() {
// check that the f64 one step below 2 casts to exactly 2
assert_eq!(get_rounding_direction(1.9999999999999998), Direction::Up);
// check edge cases
assert_eq!(get_rounding_direction(f64::NAN), Direction::Equal);
assert_eq!(get_rounding_direction(f64::NEG_INFINITY), Direction::Equal);
assert_eq!(get_rounding_direction(f64::MIN), Direction::Down);
assert_eq!(get_rounding_direction(-f64::MIN_POSITIVE), Direction::Up);
assert_eq!(get_rounding_direction(-0.), Direction::Equal);
assert_eq!(get_rounding_direction(0.), Direction::Equal);
assert_eq!(get_rounding_direction(f64::MIN_POSITIVE), Direction::Down);
assert_eq!(get_rounding_direction(f64::MAX), Direction::Up);
assert_eq!(get_rounding_direction(f64::INFINITY), Direction::Equal);
// for all other f32
for u32_bits in 1..f32::INFINITY.to_bits() - 1 {
let f64_value = f32::from_bits(u32_bits) as f64;
let u64_bits = f64_value.to_bits();
if u32_bits % 100_000_000 == 0 {
println!("checkpoint every 600 million tests: {}", f64_value);
}
// check that the f64 equivalent to the current f32 casts to a value that is equivalent
assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
// check that the f64 one step below the f64 equivalent to the current f32 casts to a value that is one step greater
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
// check that the f64 one step above the f64 equivalent to the current f32 casts to a value that is one step less
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);
// same checks for negative numbers
let u64_bits = (-f64_value).to_bits();
assert_eq!(get_rounding_direction(f64_value), Direction::Equal, "at {}, {}", u32_bits, f64_value);
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits - 1)), Direction::Down, "at {}, {}", u32_bits, f64_value);
assert_eq!(get_rounding_direction(f64::from_bits(u64_bits + 1)), Direction::Up, "at {}, {}", u32_bits, f64_value);
}
}
要具体地向无穷大舍入:
fn cast_toward_inf(vf64: f64) -> f32 {
let vf32 = vf64 as f32;
if vf64 > vf32 as f64 { f32::from_bits(vf32.to_bits() + 1) } else { vf32 }
}
可以主要从第 28 位(尾数截断部分的第一位)确定舍入,但处理边缘情况会带来很大的复杂性。
您发现的边缘情况与以下事实有关:次正规的 f32 值实际上可以表示小于其典型最小值的指数。我写了一个我认为涵盖所有边缘情况的函数:
const F64_MANTISSA_SIZE: u64 = 52;
const F64_MANTISSA_MASK: u64 = (1 << F64_MANTISSA_SIZE) - 1;
const F64_EXPONENT_SIZE: u64 = 64 - F64_MANTISSA_SIZE - 1;
const F64_EXPONENT_MASK: u64 = (1 << F64_EXPONENT_SIZE) - 1; // shift away the mantissa first
const F32_MANTISSA_SIZE: u64 = 23;
const F64_TRUNCATED_MANTISSA_SIZE: u64 = F64_MANTISSA_SIZE - F32_MANTISSA_SIZE;
const F64_TRUNCATED_MANTISSA_MASK: u64 = (1 << F64_TRUNCATED_MANTISSA_SIZE) - 1;
fn is_exactly_representable_as_f32(v: f64) -> bool {
let bits = v.to_bits();
let mantissa = bits & F64_MANTISSA_MASK;
let exponent = (bits >> F64_MANTISSA_SIZE) & F64_EXPONENT_MASK;
let _sign = bits >> (F64_MANTISSA_SIZE + F64_EXPONENT_SIZE) != 0;
if exponent == 0 {
// if mantissa == 0, the float is 0 or -0, which is representable
// if mantissa != 0, it's a subnormal, which is never representable
return mantissa == 0;
}
if exponent == F64_EXPONENT_MASK {
// either infinity or nan, all of which are representable
return true;
}
// remember to subtract the bias
let computed_exponent = exponent as i64 - 1023;
// -126 and 127 are the min and max value for a standard f32 exponent
if (-126..=127).contains(&computed_exponent) {
// at this point, it's only exactly representable if the truncated mantissa is all zero
return mantissa & F64_TRUNCATED_MANTISSA_MASK == 0;
}
// exponents less than 2**(-126) may be representable by f32 subnormals
if computed_exponent < -126 {
// this is the number of leading zeroes that need to be in the f32 mantissa
let diff = -127 - computed_exponent;
// this is the number of bits in the mantissa that must be preserved (essentially mantissa with trailing zeroes trimmed off)
let mantissa_bits = F64_MANTISSA_SIZE - (mantissa.trailing_zeros() as u64).min(F64_MANTISSA_SIZE) + 1;
// the leading zeroes + essential mantissa bits must be able to fit in the smaller mantissa size
return diff as u64 + mantissa_bits <= F32_MANTISSA_SIZE;
}
// the exponent is >127 so f32s can't go that high
return false;
}