Python 在实现大型数组的 np.std 时是否存在错误?
Does Python have a bug in implementation of np.std for large arrays?
我正在尝试通过 np.std(array,ddof = 0) 计算方差。如果我碰巧有一个长增量数组,即数组中的所有值都相同,问题就会出现。它没有返回 std = 0,而是给出了一些小值,这反过来会导致进一步的估计错误。均值正确返回...
示例:
np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)
给出 1.80411241502e-16
但是
np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)
给出标准 = 0
有没有办法克服这个问题,除了现在根本不计算 std 就检查每次迭代的数据的唯一性?
谢谢
P.S。在标记为 Is floating point math broken? 重复之后,复制粘贴@kxr 关于为什么这是一个不同问题的回复:
"The current duplicate marking is wrong. Its not just about simple float comparison, but about internal aggregation of small errors for near-zero outcome by using the np.std on long arrays - as the questioner indicated extra. Compare e.g. >>> np.std([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]*200000) -> 2.0808632594793153e-12
. So he can e.g. solve by: >>> mean = a.mean(); xmean = round(mean, int(-log10(mean)+9)); std = np.sqrt(((a - xmean) ** 2).sum()/ a.size)
"
问题当然是从浮动表示开始的,但并不止于此。
@kxr - 我很欣赏评论和例子
欢迎来到实用数值算法的世界!在现实生活中,如果你有两个浮点数 x
和 y
,检查 x == y
是没有意义的。因此,标准偏差是否为 0 的问题没有任何意义,它接近或不接近。让我们使用 np.isclose
检查一下
import numpy as np
>>> np.isclose(1.80411241502e-16, 0)
True
这就是您所能期望的最好的效果。在现实生活中,您甚至无法像您建议的那样检查所有物品是否相同。它们是浮点数吗?它们是由其他一些过程产生的吗?他们也会有小错误。
我正在尝试通过 np.std(array,ddof = 0) 计算方差。如果我碰巧有一个长增量数组,即数组中的所有值都相同,问题就会出现。它没有返回 std = 0,而是给出了一些小值,这反过来会导致进一步的估计错误。均值正确返回... 示例:
np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)
给出 1.80411241502e-16
但是
np.std([0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1],ddof = 0)
给出标准 = 0
有没有办法克服这个问题,除了现在根本不计算 std 就检查每次迭代的数据的唯一性?
谢谢
P.S。在标记为 Is floating point math broken? 重复之后,复制粘贴@kxr 关于为什么这是一个不同问题的回复:
"The current duplicate marking is wrong. Its not just about simple float comparison, but about internal aggregation of small errors for near-zero outcome by using the np.std on long arrays - as the questioner indicated extra. Compare e.g. >>> np.std([0.1, 0.1, 0.1, 0.1, 0.1, 0.1]*200000) -> 2.0808632594793153e-12
. So he can e.g. solve by: >>> mean = a.mean(); xmean = round(mean, int(-log10(mean)+9)); std = np.sqrt(((a - xmean) ** 2).sum()/ a.size)
"
问题当然是从浮动表示开始的,但并不止于此。 @kxr - 我很欣赏评论和例子
欢迎来到实用数值算法的世界!在现实生活中,如果你有两个浮点数 x
和 y
,检查 x == y
是没有意义的。因此,标准偏差是否为 0 的问题没有任何意义,它接近或不接近。让我们使用 np.isclose
import numpy as np
>>> np.isclose(1.80411241502e-16, 0)
True
这就是您所能期望的最好的效果。在现实生活中,您甚至无法像您建议的那样检查所有物品是否相同。它们是浮点数吗?它们是由其他一些过程产生的吗?他们也会有小错误。