如何将浮点数的尾数截断为 Java 中的任意精度?
How do I truncate the significand of a floating point number to an arbitrary precision in Java?
我想在比较的两个数字中引入一些人为的精度损失,以消除较小的舍入误差,这样我就不必在涉及 x
的每个比较中都使用 Math.abs(x - y) < eps
习语和 y
.
本质上,我想要一些类似于将 double
向下转换为 float
然后向上转换回 double
的东西,除了我还想保留非常大和非常小的指数,我想对保留的有效位数进行一些控制。
给定以下函数生成 64 位 IEEE 754 数字的有效数字的二进制表示:
public static String significand(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH);
}
我想要一个函数 reducePrecision(double x, int bits)
来降低 double
的有效数字的精度,这样:
significand(reducePrecision(x, bits)).substring(bits).equals(String.format("%0" + (52 - bits) + "d", 0))
换句话说,reducePrecision(x, bits)
的尾数中bits
-最高位之后的每一位都应为0,而bits
-尾数中的最高位后的每一位都应为0 reducePrecision(x, bits)
应该合理地近似于 x
.
的有效数中的 bits
-最重要的位
假设 x
是您希望降低精度的数字,bits
是您希望保留的有效位数。
当bits
足够大并且x
的数量级足够接近0时,那么x * (1L << (bits - Math.getExponent(x)))
将对x
进行缩放,使得需要的位要删除的位将出现在小数部分(小数点之后),而要保留的位将出现在整数部分(小数点之前)。然后您可以将其四舍五入以去除小数部分,然后将四舍五入的数字除以 (1L << (bits - Math.getExponent(x)))
以恢复 x
的数量级,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * (1L << exponent)) / (1L << exponent);
}
但是,(1L << exponent)
会在Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62
时崩溃。解决方案是使用 Math.pow(2, exponent)
(或 中的快速 pow2(exponent)
实现)来计算分数或非常大的 2 的幂,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}
但是,Math.pow(2, exponent)
会随着 exponent
接近 -1074 或 +1023 而分解。解决方案是使用 Math.scalb(x, exponent)
这样就不必显式计算 2 的幂,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}
但是,Math.round(y)
returns 一个 long
所以它不保留 Infinity
、NaN
和 Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent)
的情况。此外,Math.round(y)
总是将关系四舍五入为正无穷大(例如 Math.round(0.5) == 1 && Math.round(1.5) == 2
)。解决方案是使用 Math.rint(y)
接收 double
并保留无偏的 IEEE 754 舍入到最近、结对偶规则(例如 Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0
),即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}
最后,这是一个单元测试,证实了我们的预期:
public static String decompose(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0, 0 + SIGN_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}
public static void test() {
// Use a fixed seed so the generated numbers are reproducible.
java.util.Random r = new java.util.Random(0);
// Generate a floating point number that makes use of its full 52 bits of significand precision.
double a = r.nextDouble() * 100;
System.out.println(decompose(a) + " " + a);
Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// Cast the double to a float to produce a "ground truth" of precision loss to compare against.
double b = (float) a;
System.out.println(decompose(b) + " " + b);
Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
double c = reducePrecision(a, 23);
System.out.println(decompose(c) + " " + c);
Assert.assertTrue(b == c);
// 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
// Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
double d = reducePrecision(c, 22);
System.out.println(decompose(d) + " " + d);
Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
// 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
// Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
double e = reducePrecision(c, 20);
System.out.println(decompose(e) + " " + e);
Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');
// Reduce the precision of a number close to the largest normal number.
double f = reducePrecision(a * 0x1p+1017, 23);
System.out.println(decompose(f) + " " + f);
// Reduce the precision of a number close to the smallest normal number.
double g = reducePrecision(a * 0x1p-1028, 23);
System.out.println(decompose(g) + " " + g);
// Reduce the precision of a number close to the smallest subnormal number.
double h = reducePrecision(a * 0x1p-1051, 23);
System.out.println(decompose(h) + " " + h);
}
及其输出:
0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315
我想在比较的两个数字中引入一些人为的精度损失,以消除较小的舍入误差,这样我就不必在涉及 x
的每个比较中都使用 Math.abs(x - y) < eps
习语和 y
.
本质上,我想要一些类似于将 double
向下转换为 float
然后向上转换回 double
的东西,除了我还想保留非常大和非常小的指数,我想对保留的有效位数进行一些控制。
给定以下函数生成 64 位 IEEE 754 数字的有效数字的二进制表示:
public static String significand(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH);
}
我想要一个函数 reducePrecision(double x, int bits)
来降低 double
的有效数字的精度,这样:
significand(reducePrecision(x, bits)).substring(bits).equals(String.format("%0" + (52 - bits) + "d", 0))
换句话说,reducePrecision(x, bits)
的尾数中bits
-最高位之后的每一位都应为0,而bits
-尾数中的最高位后的每一位都应为0 reducePrecision(x, bits)
应该合理地近似于 x
.
bits
-最重要的位
假设 x
是您希望降低精度的数字,bits
是您希望保留的有效位数。
当bits
足够大并且x
的数量级足够接近0时,那么x * (1L << (bits - Math.getExponent(x)))
将对x
进行缩放,使得需要的位要删除的位将出现在小数部分(小数点之后),而要保留的位将出现在整数部分(小数点之前)。然后您可以将其四舍五入以去除小数部分,然后将四舍五入的数字除以 (1L << (bits - Math.getExponent(x)))
以恢复 x
的数量级,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * (1L << exponent)) / (1L << exponent);
}
但是,(1L << exponent)
会在Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62
时崩溃。解决方案是使用 Math.pow(2, exponent)
(或 pow2(exponent)
实现)来计算分数或非常大的 2 的幂,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}
但是,Math.pow(2, exponent)
会随着 exponent
接近 -1074 或 +1023 而分解。解决方案是使用 Math.scalb(x, exponent)
这样就不必显式计算 2 的幂,即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}
但是,Math.round(y)
returns 一个 long
所以它不保留 Infinity
、NaN
和 Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent)
的情况。此外,Math.round(y)
总是将关系四舍五入为正无穷大(例如 Math.round(0.5) == 1 && Math.round(1.5) == 2
)。解决方案是使用 Math.rint(y)
接收 double
并保留无偏的 IEEE 754 舍入到最近、结对偶规则(例如 Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0
),即:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}
最后,这是一个单元测试,证实了我们的预期:
public static String decompose(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0, 0 + SIGN_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}
public static void test() {
// Use a fixed seed so the generated numbers are reproducible.
java.util.Random r = new java.util.Random(0);
// Generate a floating point number that makes use of its full 52 bits of significand precision.
double a = r.nextDouble() * 100;
System.out.println(decompose(a) + " " + a);
Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// Cast the double to a float to produce a "ground truth" of precision loss to compare against.
double b = (float) a;
System.out.println(decompose(b) + " " + b);
Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
double c = reducePrecision(a, 23);
System.out.println(decompose(c) + " " + c);
Assert.assertTrue(b == c);
// 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
// Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
double d = reducePrecision(c, 22);
System.out.println(decompose(d) + " " + d);
Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
// 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
// Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
double e = reducePrecision(c, 20);
System.out.println(decompose(e) + " " + e);
Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');
// Reduce the precision of a number close to the largest normal number.
double f = reducePrecision(a * 0x1p+1017, 23);
System.out.println(decompose(f) + " " + f);
// Reduce the precision of a number close to the smallest normal number.
double g = reducePrecision(a * 0x1p-1028, 23);
System.out.println(decompose(g) + " " + g);
// Reduce the precision of a number close to the smallest subnormal number.
double h = reducePrecision(a * 0x1p-1051, 23);
System.out.println(decompose(h) + " " + h);
}
及其输出:
0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315