在 dplyr 中使用分组数据进行卡方检验
Chi -Square test with grouped data in dplyr
我很难总结如下所示的 data.frame
:
db <- data.frame(ID = c(rep(1, 3), rep(2,4), rep(3, 2), 4),
Gender = factor(c(rep("woman", 7), rep("man", 2), "woman")),
Grade = c(rep(3, 3), rep(1, 4), rep(2, 2), 1),
Drug = c(1, 2, 2, 1, 2, 6, 9, 8, 5, 1),
Group = c(rep(1, 3), rep(2,4), rep(1, 2), 2))
db
# ID Gender Grade Drug Group
# 1 1 woman 3 1 1
# 2 1 woman 3 2 1
# 3 1 woman 3 2 1
# 4 2 woman 1 1 2
# 5 2 woman 1 2 2
# 6 2 woman 1 6 2
# 7 2 woman 1 9 2
# 8 3 man 2 8 1
# 9 3 man 2 5 1
# 10 4 woman 1 1 2
理想情况下,我每次观察会有一行,但由于 Drugs
随着时间的推移而变化,我最终得到了很多重复的行。这让我很难分析。
我的最终目标是构建一个摘要 table,正如另一个 post 中已经讨论过的那样:。像这样:
|变量 |第 1 组 |第 2 组 |第 1/2 组差异 |
| 性别 ...............................| .........p = 1 |
|男...... |............1 | ...................0 | ...................................|
|女性。 |................1 |................2 |................................ ...............|
但是,由于这个post只是部分回答,不能直接适用于我的问题(主要是重复的行),如果能单独进行汇总统计,我已经很高兴了。在这个 post: 中,我询问了如何从观察中获得 unique/distinct 频率。现在,我需要查明两组之间的性别分布是否存在 统计显着差异。
根据ID
,我知道有四个观察结果,其中三个是女性,一个是男性。所以期望的结果可以这样计算:
gen <- factor(c("woman", "woman", "man", "woman"))
gr <- c(1, 2 ,1 ,2)
chisq.test(gen, gr)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: gen and gr
# X-squared = 0, df = 1, p-value = 1
#
# Warning message:
# In chisq.test(gen, gr) : Chi-squared approximation may be incorrect
如何使用 dplyr
从我的 data.frame
计算 p 值?
我失败的方法是:
db %>%
group_by(ID) %>%
distinct(ID, Gender, Group) %>%
summarise_all(funs(chisq.test(db$Gender,
db$Group)$p.value))
# A tibble: 4 x 3
# ID Gender Group
# <dbl> <dbl> <dbl>
# 1 1. 0.429 0.429
# 2 2. 0.429 0.429
# 3 3. 0.429 0.429
# 4 4. 0.429 0.429
# Warning messages:
# 1: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 2: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 3: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 4: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 5: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 6: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 7: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 8: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
我们可以 ungroup
然后用 summarise
得到 pvalue
db %>%
group_by(ID) %>%
distinct(ID, Gender, Group) %>%
ungroup %>%
summarise(pval = chisq.test(Gender, Group)$p.value)
我很难总结如下所示的 data.frame
:
db <- data.frame(ID = c(rep(1, 3), rep(2,4), rep(3, 2), 4),
Gender = factor(c(rep("woman", 7), rep("man", 2), "woman")),
Grade = c(rep(3, 3), rep(1, 4), rep(2, 2), 1),
Drug = c(1, 2, 2, 1, 2, 6, 9, 8, 5, 1),
Group = c(rep(1, 3), rep(2,4), rep(1, 2), 2))
db
# ID Gender Grade Drug Group
# 1 1 woman 3 1 1
# 2 1 woman 3 2 1
# 3 1 woman 3 2 1
# 4 2 woman 1 1 2
# 5 2 woman 1 2 2
# 6 2 woman 1 6 2
# 7 2 woman 1 9 2
# 8 3 man 2 8 1
# 9 3 man 2 5 1
# 10 4 woman 1 1 2
理想情况下,我每次观察会有一行,但由于 Drugs
随着时间的推移而变化,我最终得到了很多重复的行。这让我很难分析。
我的最终目标是构建一个摘要 table,正如另一个 post 中已经讨论过的那样:
|变量 |第 1 组 |第 2 组 |第 1/2 组差异 |
| 性别 ...............................| .........p = 1 |
|男...... |............1 | ...................0 | ...................................|
|女性。 |................1 |................2 |................................ ...............|
但是,由于这个post只是部分回答,不能直接适用于我的问题(主要是重复的行),如果能单独进行汇总统计,我已经很高兴了。在这个 post:
根据ID
,我知道有四个观察结果,其中三个是女性,一个是男性。所以期望的结果可以这样计算:
gen <- factor(c("woman", "woman", "man", "woman"))
gr <- c(1, 2 ,1 ,2)
chisq.test(gen, gr)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: gen and gr
# X-squared = 0, df = 1, p-value = 1
#
# Warning message:
# In chisq.test(gen, gr) : Chi-squared approximation may be incorrect
如何使用 dplyr
从我的 data.frame
计算 p 值?
我失败的方法是:
db %>%
group_by(ID) %>%
distinct(ID, Gender, Group) %>%
summarise_all(funs(chisq.test(db$Gender,
db$Group)$p.value))
# A tibble: 4 x 3
# ID Gender Group
# <dbl> <dbl> <dbl>
# 1 1. 0.429 0.429
# 2 2. 0.429 0.429
# 3 3. 0.429 0.429
# 4 4. 0.429 0.429
# Warning messages:
# 1: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 2: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 3: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 4: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 5: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 6: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 7: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
# 8: In chisq.test(db$Gender, db$Group) :
# Chi-squared approximation may be incorrect
我们可以 ungroup
然后用 summarise
pvalue
db %>%
group_by(ID) %>%
distinct(ID, Gender, Group) %>%
ungroup %>%
summarise(pval = chisq.test(Gender, Group)$p.value)