Python 和 R 中卡方检验的不同 P 值
Different P values for chi square test in Python and R
我正在尝试对两类生物数据进行卡方检验。我有一个这样的数据框:
Brain, Cerebelum, Heart, Kidney, liver, testis
expected 3 66 1 44 34 88
observed 6 57 4 45 35 69
structure(list(Brain = c(3L, 6L), Cerebelum = c(66L, 57L), heart = c(1L,
4L), kidney = 44:45, liver = 34:35, testis = c(88L, 69L)), .Names = c("Brain",
"Cerebelum", "heart", "kidney", "liver", "testis"), class = "data.frame", row.names = c("rand",
"cns"))
我用Python做了测试:
from scipy.stats import chisquare
chisquare(obs,f_exp=exp)
结果为:
Power_divergenceResult(statistic=17.381684491978611, pvalue=0.0038300192430189722)
我尝试使用 R 复制结果,所以我制作了 csv 文件,作为数据框导入到 R,运行 代码为:
d<-read.csv(file)
chisq.test(d)
Pearson's Chi-squared test
data: d
X-squared = 4.9083, df = 5, p-value = 0.4272
为什么 python 和 R 的卡方值和 P 值不同?我用简单的 (O-E)^2/E 公式手工计算,卡方值等于 17.38由 python 计算,但我无法弄清楚 R 如何计算 4.90 的值。
我可以回答你的第一个问题。
chisq.test
,当你给它一个具有 > 2
行和列的矩阵时,将其视为二维偶然性 table 并测试沿行和列的观察之间的独立性. Here's an example and another one.
另一方面,scipy.stats.chisq
只是做 definition of the test stat 中熟悉的 X = sum( (O_i-E_i)^2 / E_i)
。
那么如何对圆进行平方呢?首先,传递 R
观测值,然后在参数 p
中定义预期概率。其次,您还需要阻止 R 进行默认的连续性校正。
e <- d[1, ]
o <- d[2, ]
chisq.test(o, p = e / sum(e), correct = FALSE)
瞧
Chi-squared test for given probabilities
data: o
X-squared = 17.139, df = 5, p-value = 0.004243
PS SO 的棘手问题,交叉验证可能更好?
请注意,与 scipy
相比,R 的默认更正可能是一件好事。这是否属实肯定是为了交叉验证。
PPS
?chisq.test
中的帮助有点难以解析,但我认为这些都在某处;)
If ‘x’ is a matrix with one row or column, or if ‘x’ is a vector
and ‘y’ is not given, then a _goodness-of-fit test_ is performed
(‘x’ is treated as a one-dimensional contingency table). The
entries of ‘x’ must be non-negative integers. In this case, the
hypothesis tested is whether the population probabilities equal
those in ‘p’, or are all equal if ‘p’ is not given.
If ‘x’ is a matrix with at least two rows and columns, it is taken
as a two-dimensional contingency table: the entries of ‘x’ must be
non-negative integers. Otherwise, ‘x’ and ‘y’ must be vectors or
factors of the same length; cases with missing values are removed,
the objects are coerced to factors, and the contingency table is
computed from these. Then Pearson's chi-squared test is performed
of the null hypothesis that the joint distribution of the cell
counts in a 2-dimensional contingency table is the product of the
row and column marginals.
和
correct: a logical indicating whether to apply continuity correction
when computing the test statistic for 2 by 2 tables: one half
is subtracted from all |O - E| differences; however, the
correction will not be bigger than the differences
themselves. No correction is done if ‘simulate.p.value =
TRUE’.
我正在尝试对两类生物数据进行卡方检验。我有一个这样的数据框:
Brain, Cerebelum, Heart, Kidney, liver, testis
expected 3 66 1 44 34 88
observed 6 57 4 45 35 69
structure(list(Brain = c(3L, 6L), Cerebelum = c(66L, 57L), heart = c(1L,
4L), kidney = 44:45, liver = 34:35, testis = c(88L, 69L)), .Names = c("Brain",
"Cerebelum", "heart", "kidney", "liver", "testis"), class = "data.frame", row.names = c("rand",
"cns"))
我用Python做了测试:
from scipy.stats import chisquare
chisquare(obs,f_exp=exp)
结果为:
Power_divergenceResult(statistic=17.381684491978611, pvalue=0.0038300192430189722)
我尝试使用 R 复制结果,所以我制作了 csv 文件,作为数据框导入到 R,运行 代码为:
d<-read.csv(file)
chisq.test(d)
Pearson's Chi-squared test
data: d
X-squared = 4.9083, df = 5, p-value = 0.4272
为什么 python 和 R 的卡方值和 P 值不同?我用简单的 (O-E)^2/E 公式手工计算,卡方值等于 17.38由 python 计算,但我无法弄清楚 R 如何计算 4.90 的值。
我可以回答你的第一个问题。
chisq.test
,当你给它一个具有 > 2
行和列的矩阵时,将其视为二维偶然性 table 并测试沿行和列的观察之间的独立性. Here's an example and another one.
scipy.stats.chisq
只是做 definition of the test stat 中熟悉的 X = sum( (O_i-E_i)^2 / E_i)
。
那么如何对圆进行平方呢?首先,传递 R
观测值,然后在参数 p
中定义预期概率。其次,您还需要阻止 R 进行默认的连续性校正。
e <- d[1, ]
o <- d[2, ]
chisq.test(o, p = e / sum(e), correct = FALSE)
瞧
Chi-squared test for given probabilities
data: o
X-squared = 17.139, df = 5, p-value = 0.004243
PS SO 的棘手问题,交叉验证可能更好?
请注意,与 scipy
相比,R 的默认更正可能是一件好事。这是否属实肯定是为了交叉验证。
PPS
?chisq.test
中的帮助有点难以解析,但我认为这些都在某处;)
If ‘x’ is a matrix with one row or column, or if ‘x’ is a vector
and ‘y’ is not given, then a _goodness-of-fit test_ is performed
(‘x’ is treated as a one-dimensional contingency table). The
entries of ‘x’ must be non-negative integers. In this case, the
hypothesis tested is whether the population probabilities equal
those in ‘p’, or are all equal if ‘p’ is not given.
If ‘x’ is a matrix with at least two rows and columns, it is taken
as a two-dimensional contingency table: the entries of ‘x’ must be
non-negative integers. Otherwise, ‘x’ and ‘y’ must be vectors or
factors of the same length; cases with missing values are removed,
the objects are coerced to factors, and the contingency table is
computed from these. Then Pearson's chi-squared test is performed
of the null hypothesis that the joint distribution of the cell
counts in a 2-dimensional contingency table is the product of the
row and column marginals.
和
correct: a logical indicating whether to apply continuity correction
when computing the test statistic for 2 by 2 tables: one half
is subtracted from all |O - E| differences; however, the
correction will not be bigger than the differences
themselves. No correction is done if ‘simulate.p.value =
TRUE’.