R 如何生成正态分布的概率向量以用于 chisq.test
R How to generate a vector of probabilities normally distributed to be used at chisq.test
我有一个包含 30 个样本的向量,我想检验样本来自正态分布总体的假设。
> N.concentration
[1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
[13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
[25] 0.141 0.100 0.104 0.078 0.121 0.104
我使用 hist
制作了一个频率向量
> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
[1] 3 5 4 4 5 4 2 2 1
我正在使用 chisq.test
检查 N.freq
是否符合正态分布,但是,该函数需要一个参数 p = 相同概率的向量x 的长度,如 chisq.test 文档中所定义。我正在尝试为它生成一个向量,但老实说,我不知道要生成什么。我在努力
> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
[1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
[6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
Error in chisq.test(p1$count, p = p) :
probabilities must sum to 1.
我考虑过使用 rescale.p=TRUE
,但我不确定这是否会产生有效的测试。
编辑:如果我使用 rescale.p,我会收到一条警告消息
> chisq.test(N.freq, p=p, rescale.p=TRUE)
Chi-squared test for given probabilities
data: N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16
Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect
正如我所说,要检验正态性,我们必须知道零假设中正态分布的均值和标准误差。由于没有给定值,我们必须根据您的 30 条数据进行估算。
x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057,
0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04,
0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104,
0.078, 0.121, 0.104)
mu <- mean(x)
sig <- sd(x)
现在,正如您所做的那样,我们需要对数据进行分箱:
h <- hist(x, breaks = 10)
#List of 6
# $ breaks : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
# $ counts : int [1:9] 3 5 4 4 5 4 2 2 1
# $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
# $ mids : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
# $ xname : chr "x"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"
要获得零假设下的真实概率,我们需要每个 bin 单元格的概率,即中断之间的概率。
p <- diff(pnorm(h$breaks, mu, sig))
#[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
#[8] 0.04552387 0.01921839
我倾向于不相信只有 30 个数据的卡方检验。但这里是我们如何使用 chisq.test
:
chisq.test(h$counts, p = p, rescale.p = TRUE)
#
# Chi-squared test for given probabilities
#
#data: h$counts
#X-squared = 3.1476, df = 8, p-value = 0.9248
#
#Warning message:
#In chisq.test(h$counts, p, rescale.p = TRUE) :
# Chi-squared approximation may be incorrect
通常您不需要理会警告消息。如果你想摆脱它,设置 simulate.p.value = TRUE
:
chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
#
# Chi-squared test for given probabilities with simulated p-value (based
# on 2000 replicates)
#
#data: h$counts
#X-squared = 3.1476, df = NA, p-value = 0.942
有许多 统计测试旨在测试指定数据集的正态性偏差(例如,参见 nortest package). However, you should be aware that many statisticians feel that normality testing is "essentially useless":特别是(来自链接的 CrossValidated 问题):
The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.
然而,继续使用基础 R 中的 Shapiro-Wilk test(根据维基百科页面,Shapiro-Wilk 具有良好的功率 - 但从上面的讨论中请注意,高功率不一定在这种情况下我们真正想要的是什么 ...)
d <- c(0.164,0.045,0.069,0.100,0.050,0.080,0.043,0.036,0.057,0.154,
0.133,0.193,0.129,0.121,0.081,0.178,0.041,0.040,0.116,0.078,
0.104,0.095,0.116,0.038,0.141,0.100,0.104,0.078,0.121,0.104)
shapiro.test(d)
## Shapiro-Wilk normality test
##
## data: d
## W = 0.9547, p-value = 0.2255
图形方法:
par(las=1,bty="l")
qqnorm(d)
qqline(d)
这些点相当符合这条线,最大的偏差(数据集中三个最小的点)实际上比预期的要大,这意味着数据集在下端略微薄尾,这意味着基于正态性假设的测试通常会略显保守。
我有一个包含 30 个样本的向量,我想检验样本来自正态分布总体的假设。
> N.concentration
[1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
[13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
[25] 0.141 0.100 0.104 0.078 0.121 0.104
我使用 hist
> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
[1] 3 5 4 4 5 4 2 2 1
我正在使用 chisq.test
检查 N.freq
是否符合正态分布,但是,该函数需要一个参数 p = 相同概率的向量x 的长度,如 chisq.test 文档中所定义。我正在尝试为它生成一个向量,但老实说,我不知道要生成什么。我在努力
> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
[1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
[6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
Error in chisq.test(p1$count, p = p) :
probabilities must sum to 1.
我考虑过使用 rescale.p=TRUE
,但我不确定这是否会产生有效的测试。
编辑:如果我使用 rescale.p,我会收到一条警告消息
> chisq.test(N.freq, p=p, rescale.p=TRUE)
Chi-squared test for given probabilities
data: N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16
Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect
正如我所说,要检验正态性,我们必须知道零假设中正态分布的均值和标准误差。由于没有给定值,我们必须根据您的 30 条数据进行估算。
x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057,
0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04,
0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104,
0.078, 0.121, 0.104)
mu <- mean(x)
sig <- sd(x)
现在,正如您所做的那样,我们需要对数据进行分箱:
h <- hist(x, breaks = 10)
#List of 6
# $ breaks : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
# $ counts : int [1:9] 3 5 4 4 5 4 2 2 1
# $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
# $ mids : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
# $ xname : chr "x"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"
要获得零假设下的真实概率,我们需要每个 bin 单元格的概率,即中断之间的概率。
p <- diff(pnorm(h$breaks, mu, sig))
#[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
#[8] 0.04552387 0.01921839
我倾向于不相信只有 30 个数据的卡方检验。但这里是我们如何使用 chisq.test
:
chisq.test(h$counts, p = p, rescale.p = TRUE)
#
# Chi-squared test for given probabilities
#
#data: h$counts
#X-squared = 3.1476, df = 8, p-value = 0.9248
#
#Warning message:
#In chisq.test(h$counts, p, rescale.p = TRUE) :
# Chi-squared approximation may be incorrect
通常您不需要理会警告消息。如果你想摆脱它,设置 simulate.p.value = TRUE
:
chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
#
# Chi-squared test for given probabilities with simulated p-value (based
# on 2000 replicates)
#
#data: h$counts
#X-squared = 3.1476, df = NA, p-value = 0.942
有许多 统计测试旨在测试指定数据集的正态性偏差(例如,参见 nortest package). However, you should be aware that many statisticians feel that normality testing is "essentially useless":特别是(来自链接的 CrossValidated 问题):
The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.
然而,继续使用基础 R 中的 Shapiro-Wilk test(根据维基百科页面,Shapiro-Wilk 具有良好的功率 - 但从上面的讨论中请注意,高功率不一定在这种情况下我们真正想要的是什么 ...)
d <- c(0.164,0.045,0.069,0.100,0.050,0.080,0.043,0.036,0.057,0.154,
0.133,0.193,0.129,0.121,0.081,0.178,0.041,0.040,0.116,0.078,
0.104,0.095,0.116,0.038,0.141,0.100,0.104,0.078,0.121,0.104)
shapiro.test(d)
## Shapiro-Wilk normality test
##
## data: d
## W = 0.9547, p-value = 0.2255
图形方法:
par(las=1,bty="l")
qqnorm(d)
qqline(d)
这些点相当符合这条线,最大的偏差(数据集中三个最小的点)实际上比预期的要大,这意味着数据集在下端略微薄尾,这意味着基于正态性假设的测试通常会略显保守。