AssertionError: negative sum of square deviations

Question

作为一个更大项目的一部分，我正在编写一个函数，它接受一个整数字典和 returns 一个字典，每个 "outer" 键都链接到均值和该子词典的标准偏差（即 (mean(dict[key1]), stdev(dict[key1])) ）。我正在处理一个大型数据集（源文件是一个 2.8 GB 的 csv 文件），并且在计算其中一个子字典的标准差时出现断言错误。

虽然我将（并且目前正在）追踪导致以下错误的子字典，但我很好奇一般情况下会导致它，因此如果它发生在我的数据集中，我可以尽量避免它。

我收到的错误消息是：

AssertionError: negative sum of square deviations: -3734262324235.697754

来自代码：

import statistics as stat

try: #Check for single value error
    std = stat.stdev(val)
except stat.StatisticsError:
    std = 0

Answer 1

statiscs.py 中的代码是纯代码 Python - 在处理内部“平方和”时，您似乎是分数 class 中奇怪溢出错误的受害者statistics._ss 函数。

我认为你现在能做的最好的事情就是在 statistics.py 文件本身中使用 "if" 调用 pdb.set_trace 来设置 _ss 函数以交互方式查找导致错误的数据（代码中有注释，这部分会出现舍入错误）。它计算一个应该为零的分数 - 但对于舍入误差，并对该分数进行平方。但是在平方时，已经很大的分母本身被平方 - 这可能会触发 Python 的分数内部的错误，并在它应该接近零时返回一个非常大的值。

这样的 "if" 子句可以让您 (1) 绕过错误条件并运行您的代码到最后，在发现错误时将值强制为零； (2) 记下导致错误的值，并将其作为错误报告给 Python 语言本身。

Answer 2

@jsbueno 提到，这是 statistics.py 文件的问题。我也有同样的错误并通过将 statistics.stdev 替换为 numpy.std 而不是更改源代码来解决它。

Answer 3

我在非常小的数字上遇到了同样的问题。 sum(x²) 的准确计算得出的结果为零 (Fraction(0,1))，但 sum(x) 的准确计算给出了一个非常小的正分数，代表舍入误差和减法中的精度损失来自数据的意思。

statistics.py中的代码表明total2应该为零，但实际上它可以是任何小数，正数或负数。 total2的平方总是一个小的正分数

def _ss(data, c=None):
    """Return sum of square deviations of sequence data.

    If ``c`` is None, the mean is calculated in one pass, and the deviations
    from the mean are calculated in a second pass. Otherwise, deviations are
    calculated from ``c`` as given. Use the second case with care, as it can
    lead to garbage results.
    """
    if c is None:
        c = mean(data)
    T, total, count = _sum((x-c)**2 for x in data)
    # The following sum should mathematically equal zero, but due to rounding
    # error may not.
    U, total2, count2 = _sum((x-c) for x in data)
    assert T == U and count == count2
    total -=  total2**2/len(data)
    assert not total < 0, 'negative sum of square deviations: %f' % total
    return (T, total)

因此，方差总和可能在断言失败之前变为负值。

根本原因是在第一个 _sum 函数调用中对每个值求平方时发生的准确性损失。 float 或 np.float64 值通过浮点运算在列表理解中进行平方。

一种可能的更正方法是在对它求平方之前将 total2 转换为类型 T。它改变了语义，因为 _ss returns 类型为 T 的值而不是精确分数。另一种更准确的方法是在第一次调用 _sum 之前将 x-c 一劳永逸地转换为分数。在这两种情况下，计算也会运行更快。

最合适的更正并非微不足道，因为 _sum 还聚合了对 _coerce 的连续调用的类型。之前将数据转换为分数也会将结果类型更改为分数。

AssertionError: negative sum of square deviations

AssertionError: negative sum of square deviations

python

statistics

dictionary

standard-deviation

python-3.x