R 如何生成正态分布的概率向量以用于 chisq.test

R How to generate a vector of probabilities normally distributed to be used at chisq.test

我有一个包含 30 个样本的向量,我想检验样本来自正态分布总体的假设。

> N.concentration
  [1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
  [13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
  [25] 0.141 0.100 0.104 0.078 0.121 0.104

我使用 hist

制作了一个频率向量
> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
  [1] 3 5 4 4 5 4 2 2 1

我正在使用 chisq.test 检查 N.freq 是否符合正态分布,但是,该函数需要一个参数 p = 相同概率的向量x 的长度,如 chisq.test 文档中所定义。我正在尝试为它生成一个向量,但老实说,我不知道要生成什么。我在努力

> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
  [1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
  [6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
   Error in chisq.test(p1$count, p = p) : 
   probabilities must sum to 1.

我考虑过使用 rescale.p=TRUE,但我不确定这是否会产生有效的测试。


编辑:如果我使用 rescale.p,我会收到一条警告消息

> chisq.test(N.freq, p=p, rescale.p=TRUE)

Chi-squared test for given probabilities

data:  N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16

Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect

正如我所说,要检验正态性,我们必须知道零假设中正态分布的均值和标准误差。由于没有给定值,我们必须根据您的 30 条数据进行估算。

x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057, 
0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04, 
0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104, 
0.078, 0.121, 0.104)

mu <- mean(x)
sig <- sd(x)

现在,正如您所做的那样,我们需要对数据进行分箱:

h <- hist(x, breaks = 10)
#List of 6
# $ breaks  : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
# $ counts  : int [1:9] 3 5 4 4 5 4 2 2 1
# $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
# $ mids    : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
# $ xname   : chr "x"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"

要获得零假设下的真实概率,我们需要每个 bin 单元格的概率,即中断之间的概率。

p <- diff(pnorm(h$breaks, mu, sig))
#[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
#[8] 0.04552387 0.01921839

我倾向于不相信只有 30 个数据的卡方检验。但这里是我们如何使用 chisq.test:

chisq.test(h$counts, p = p, rescale.p = TRUE)
#
#   Chi-squared test for given probabilities
#
#data:  h$counts
#X-squared = 3.1476, df = 8, p-value = 0.9248
#
#Warning message:
#In chisq.test(h$counts, p, rescale.p = TRUE) :
#  Chi-squared approximation may be incorrect

通常您不需要理会警告消息。如果你想摆脱它,设置 simulate.p.value = TRUE:

chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
#
#   Chi-squared test for given probabilities with simulated p-value (based
#   on 2000 replicates)
#
#data:  h$counts
#X-squared = 3.1476, df = NA, p-value = 0.942

许多 统计测试旨在测试指定数据集的正态性偏差(例如,参见 nortest package). However, you should be aware that many statisticians feel that normality testing is "essentially useless":特别是(来自链接的 CrossValidated 问题):

The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.

然而,继续使用基础 R 中的 Shapiro-Wilk test(根据维基百科页面,Shapiro-Wilk 具有良好的功率 - 但从上面的讨论中请注意,高功率不一定在这种情况下我们真正想要的是什么 ...)

d <- c(0.164,0.045,0.069,0.100,0.050,0.080,0.043,0.036,0.057,0.154,
       0.133,0.193,0.129,0.121,0.081,0.178,0.041,0.040,0.116,0.078,
       0.104,0.095,0.116,0.038,0.141,0.100,0.104,0.078,0.121,0.104)
shapiro.test(d)
##  Shapiro-Wilk normality test
## 
## data:  d
## W = 0.9547, p-value = 0.2255

图形方法:

par(las=1,bty="l")
qqnorm(d)
qqline(d)

这些点相当符合这条线,最大的偏差(数据集中三个最小的点)实际上比预期的要大,这意味着数据集在下端略微薄尾,这意味着基于正态性假设的测试通常会略显保守。