为什么 NaN 和 Inf - Inf 的哈希值不同?
Why do the hash values differ for NaN and Inf - Inf?
我经常使用这个哈希函数,即记录数据帧的值。想看看我能不能打破它。为什么这些哈希值不相同?
这需要摘要包。
纯文本输出:
> digest(Inf-Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
> digest(1 + 0)
[1] "6717f2823d3202449301145073ab8719"
> digest(5)
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(sum(1, 1, 1, 1, 1))
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(1^0)
[1] "6717f2823d3202449301145073ab8719"
> 1^0
[1] 1
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
额外的怪异。等于 NaN 的计算具有相同的哈希值,但 NaN 的哈希值不相等:
> Inf - Inf
[1] NaN
> 0/0
[1] NaN
> digest(Inf - Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(0/0)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
这与使用 base::serialize
的 digest::digest
有关,它为使用 ascii = FALSE
的 2 个提到的对象提供了不同的结果,这是 [= 传递给它的默认值15=]:
identical(
base::serialize(Inf-Inf, connection = NULL, ascii = FALSE),
base::serialize(NaN, connection = NULL, ascii = FALSE)
)
# [1] FALSE
虽然
identical(Inf-Inf, NaN)
# [1] TRUE
tl;dr 这与 NaN
如何以二进制表示的非常深入的细节有关。您可以使用 digest(.,ascii=TRUE)
...
解决此问题
跟进@Jozef 的回答:注意粗体数字...
> base::serialize(Inf-Inf,connection=NULL)
[1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
[26] 00 0e 00 00 00 01 ff f8 00 00 00 00 00 00
> base::serialize(NaN,connection=NULL)
[1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
[26] 00 0e 00 00 00 01 7f f8 00 00 00 00 00 00
或者,使用 pryr::bytes()
...
> bytes(NaN)
[1] "7F F8 00 00 00 00 00 00"
> bytes(Inf-Inf)
[1] "FF F8 00 00 00 00 00 00"
Wikipedia article on floating point format/NaNs 说:
Some operations of floating-point arithmetic are invalid, such as taking the square root of a negative number. The act of reaching an invalid result is called a floating-point exception. An exceptional result is represented by a special code called a NaN, for "Not a Number". All NaNs in IEEE 754-1985 have this format:
- sign = either 0 or 1.
- biased exponent = all 1 bits.
- fraction = anything except all 0 bits (since all 0 bits represents infinity).
符号是第一位;指数是接下来的 11 位;分数是最后 52 位。将上面给出的前四个十六进制数字转换为二进制,Inf-Inf
是 1111 1111 1111 0100
(符号=1;指数是所有的,根据需要;分数以 0100
开头)而 NaN
是0111 1111 1111 0100
(相同,但符号=0)。
要理解 为什么 Inf-Inf
以符号位 1 结尾而 NaN
以符号位 0 结尾,您可能需要更深入地研究在此平台上实现浮点运算的方式 ...
可能值得对此提出 issue on the digest GitHub repo;我想不出一个 优雅 的方法来做到这一点,但在 R 中 identical(x,y)
是 TRUE
的对象应该具有相同的哈希值似乎是合理的......请注意,identical()
通过 single.NA
(默认 TRUE
)参数专门忽略了位模式中的这些差异:
single.NA: logical indicating if there is conceptually just one numeric
‘NA’ and one ‘NaN’; ‘single.NA = FALSE’ differentiates bit
patterns.
在 C 代码中,看起来 R 只是使用 C 的 !=
运算符来比较 NaN
值 除非启用 按位比较,在这种情况下它对内存位置的相等性进行显式检查:参见 here。也就是说,C 的比较运算符似乎将不同种类的 NaN
值视为等价的 ...
我经常使用这个哈希函数,即记录数据帧的值。想看看我能不能打破它。为什么这些哈希值不相同?
这需要摘要包。
纯文本输出:
> digest(Inf-Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
> digest(1 + 0)
[1] "6717f2823d3202449301145073ab8719"
> digest(5)
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(sum(1, 1, 1, 1, 1))
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(1^0)
[1] "6717f2823d3202449301145073ab8719"
> 1^0
[1] 1
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
额外的怪异。等于 NaN 的计算具有相同的哈希值,但 NaN 的哈希值不相等:
> Inf - Inf
[1] NaN
> 0/0
[1] NaN
> digest(Inf - Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(0/0)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
这与使用 base::serialize
的 digest::digest
有关,它为使用 ascii = FALSE
的 2 个提到的对象提供了不同的结果,这是 [= 传递给它的默认值15=]:
identical(
base::serialize(Inf-Inf, connection = NULL, ascii = FALSE),
base::serialize(NaN, connection = NULL, ascii = FALSE)
)
# [1] FALSE
虽然
identical(Inf-Inf, NaN)
# [1] TRUE
tl;dr 这与 NaN
如何以二进制表示的非常深入的细节有关。您可以使用 digest(.,ascii=TRUE)
...
跟进@Jozef 的回答:注意粗体数字...
> base::serialize(Inf-Inf,connection=NULL) [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 [26] 00 0e 00 00 00 01 ff f8 00 00 00 00 00 00 > base::serialize(NaN,connection=NULL) [1] 58 0a 00 00 00 03 00 03 06 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 [26] 00 0e 00 00 00 01 7f f8 00 00 00 00 00 00
或者,使用 pryr::bytes()
...
> bytes(NaN)
[1] "7F F8 00 00 00 00 00 00"
> bytes(Inf-Inf)
[1] "FF F8 00 00 00 00 00 00"
Wikipedia article on floating point format/NaNs 说:
Some operations of floating-point arithmetic are invalid, such as taking the square root of a negative number. The act of reaching an invalid result is called a floating-point exception. An exceptional result is represented by a special code called a NaN, for "Not a Number". All NaNs in IEEE 754-1985 have this format:
- sign = either 0 or 1.
- biased exponent = all 1 bits.
- fraction = anything except all 0 bits (since all 0 bits represents infinity).
符号是第一位;指数是接下来的 11 位;分数是最后 52 位。将上面给出的前四个十六进制数字转换为二进制,Inf-Inf
是 1111 1111 1111 0100
(符号=1;指数是所有的,根据需要;分数以 0100
开头)而 NaN
是0111 1111 1111 0100
(相同,但符号=0)。
要理解 为什么 Inf-Inf
以符号位 1 结尾而 NaN
以符号位 0 结尾,您可能需要更深入地研究在此平台上实现浮点运算的方式 ...
可能值得对此提出 issue on the digest GitHub repo;我想不出一个 优雅 的方法来做到这一点,但在 R 中 identical(x,y)
是 TRUE
的对象应该具有相同的哈希值似乎是合理的......请注意,identical()
通过 single.NA
(默认 TRUE
)参数专门忽略了位模式中的这些差异:
single.NA: logical indicating if there is conceptually just one numeric ‘NA’ and one ‘NaN’; ‘single.NA = FALSE’ differentiates bit patterns.
在 C 代码中,看起来 R 只是使用 C 的 !=
运算符来比较 NaN
值 除非启用 按位比较,在这种情况下它对内存位置的相等性进行显式检查:参见 here。也就是说,C 的比较运算符似乎将不同种类的 NaN
值视为等价的 ...