为什么chisq.test求和前数据要降序排列

Question

为什么 R 中的 chisq.test 函数在求和之前按降序顺序对数据进行排序？

有问题的代码是：

STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

如果我因为使用浮点运算而担心数值稳定性并想使用一些易于部署的 hack，我会在求和之前按 increasing 顺序对数据进行排序以避免在累加器中将一个小值添加到一个大值（为了尽可能避免修剪结果中的最低有效位）。

我查看了 sum 的源代码，但它没有解释为什么要以降序的顺序将数据传递给 sum()。我想念什么？

一个例子：

x = matrix(1.1, 10001, 1)
x[1] = 10^16   # We have a vector with 10000*1.1 and 1*10^16
c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))

结果：

10000000000010996 10000000000011000

当我们按升序对数据进行排序时，我们得到了正确的结果。如果我们按降序对数据进行排序，我们得到的结果会相差 4。

Answer 1

编辑： 《Accuracy and stability of numerical algorithms by Nicolas J. Higham》一书指出

"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."

感谢@Lamia 在评论区分享本书。

这本书解释了递归、插入和成对技术等三种求和方法。每种技术都有其自身的优点和缺点，具体取决于与之相关的误差范围的大小，可以通过对浮点数求和进行系统误差分析来计算。

值得注意的是，递归技术的求和结果取决于排序策略，例如递增、递减和 Psum（请参阅本书 - 第 82 页 - 第 4 段。另请参阅底部给出的示例中它的工作原理第 82 页。).

查看可从 summary.c 获得的 sum() 函数的 R 源代码表明 R 在其 sum() 函数中实现了递归方法。

另外浮点数的基数位数是53，可以从

得到

.Machine$double.digits
# [1] 53

通过将这个数字设置为精度位，我们可以比较基数R和Rmpfr库中的mpfr()对不同排序策略进行求和运算的准确性。请注意，递增阶数产生的结果更接近浮点感知求和中看到的结果，这证实了本书中的上述陈述。

我使用原始数据 x.

计算了卡方统计量

library('data.table')
library('Rmpfr')
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
df1 <- data.frame(x = x1)
setDT(df1)
df1[, p := rep(1/length(x), length(x))]
s_x <- df1[, sum(x)]
df1[, E := s_x * p]
df1[, chi := ((x - E)^2/E)]

precBits <- .Machine$double.digits
x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc",
                               "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc"))
x_chi$vals <- c( ## x
  df1[order(x), format( sum(x), digits = 22)],
  df1[order(-x), format( sum(x), digits = 22)],
  df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],
  ## chi
  df1[order(chi), format( sum(chi), digits = 22)],
  df1[order(-chi), format( sum(chi), digits = 22)],
  df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)],
  df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)])

x_chi
#         names                    vals
# 1       x_asc       10000000000011000
# 2      x_desc       10000000000010996
# 3    x_fp_asc 10000000000011000.00000
# 4   x_fp_desc 10000000000020000.00000
# 5     chi_asc    99999999999890014218
# 6    chi_desc    99999999999890030592
# 7  chi_fp_asc 99999999999890014208.00
# 8 chi_fp_desc 99999999999833554944.00

查看edit(chisq.test)函数的源码，里面没有涉及排序操作

此外，正如评论部分所指出的，它与chisq.test()函数中使用的原始数据值的符号（+ve或-ve）无关。此函数不接受负值，因此它会通过使用此消息 "all entries of 'x' must be nonnegative and finite".

停止函数来吐出错误

set.seed(2L)
chisq.test(c(rnorm(10, 0, 1)))
# Error in chisq.test(c(rnorm(10, 0, 1))) : 
#   all entries of 'x' must be nonnegative and finite

浮点数求和时的差异与双精度运算有关。请参阅下面的演示。当使用 Rmpfr 包中可用的 mpfr() 函数将浮点数的精度保持在 200 位时，无论向量 x1 或 [= 的顺序如何，求和运算都会给出相同的结果28=]。但是，当不保持浮点精度时，会观察到不相等的值。

无 FP 精度：

x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)
## reverse
x2 = matrix(c( rep(1.1, 10000), 10^16 ), 
            nrow = 10001, ncol = 1)

c( format(sum(x1), digits = 22), 
   format(sum(x2), digits = 22))
# [1] "10000000000010996" "10000000000011000"

保持的 FP 精度：

library('Rmpfr')
##
prec <- 200
x1 = matrix(c( mpfr( 10^16, precBits = prec),
              rep( mpfr(1.1, precBits = prec), 10000)), 
           nrow = 10001, ncol = 1)

## reverse
x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000), 
              mpfr( 10^16, precBits = prec) ), 
           nrow = 10001, ncol = 1)
c( sum(x1), sum(x2))
# 2 'mpfr' numbers of precision  200   bits 
# [1] 10000000000011000.000000000000888178419700125232338905334472656
# [2] 10000000000011000.000000000000888178419700125232338905334472656

可以从下面的代码中获取 R 基数中的最小正浮点数，任何小于此数的数字都将在 R 基数中被截断，这会在求和运算中产生不同的结果。

.Machine$double.eps
# [1] 2.220446e-16

chisq.test() 函数的双精度算术感知和非感知函数比较。

提取了chisq.test()的相关部分，并用它创建了一个新函数chisq.test2()。在内部，您将看到使用 mpfr() 函数对卡方统计量应用 250 位双精度感知之前和之后的比较选项。您可以看到浮点感知函数的结果相同，但原始数据的结果不同。

# modified chi square function:
chisq.test2 <- function (x, precBits) 
{
  if (is.matrix(x)) {
    if (min(dim(x)) == 1L) 
      x <- as.vector(x)
  }

  #before fp precision
  p = rep(1/length(x), length(x))
  n <- sum(x)
  E <- n * p

  # after fp precision
  x1 <- mpfr(x, precBits = precBits)
  p1 = rep(1/length(x1), length(x1))
  n1 <- sum(x1)
  E1 <- n1 * p1

  # chisquare statistic
  STATISTIC <- c(format(sum((x - E)^2/E), digits=22),           # before decreasing
                 format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing
                 sum((x1 - E1)^2/E1),                           # after decreasing 
                 sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing

  return(STATISTIC)
}

# data
x1 = matrix(c( 10^16, rep(1.1, 10000)), 
            nrow = 10001, ncol = 1)

chisq.test2(x = x1, precBits=250)

输出：

# [[1]]  # before fp decreasing
# [1] "99999999999890030592"
# 
# [[2]]  # before fp increasing
# [1] "99999999999890014218"
# 
# [[3]]  # after fp decreasing 
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685
# 
# [[4]]  # after fp increasing
# 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906

为什么chisq.test求和前数据要降序排列

Why does chisq.test sort data in descending order before summation

r

numerical-stability