运行 不同类别的多个卡方检验
Running multiple chi-squared tests for different categories
我有二进制数据取决于个人pass/failed是否测试,以及特征信息(例如性别)和他们属于哪个部门(例如x,y,z)在df(数据)中
head(data,9)
department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0
我可以轻松地 运行 对性别与通过之间的关系进行卡方检验:
chisq.test(data$gender, data$pass)
但是有没有一种方法可以将 运行 分别用于 'department' (x,y,z) 中的值,而不必每次都手动对数据进行子集化?
我可以使用 tapply 创建一个新的数据框来分解每个部门的总体通过率:
as.data.frame(tapply(data$pass, data$department,mean))
但是我有没有办法添加一个新变量来指示上述测试的结果(假设 p 值)?
使用 broom
和 dplyr
是一种优雅的方法。首先,我们按部门变量分组并嵌套我们的数据框。然后我们 运行 chisq.test
对每个 "subset"。最后,为了获得相关统计数据(例如 p.value
),我们利用 broom::tidy
。由于这些都嵌套在每个子集中,我们取消嵌套我们最终想要看到的任何组件。
有关详细信息,请参阅 this vignette
library(tidyverse)
library(broom)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department) %>%
nest() %>%
mutate(
chi_test = map(data, ~ chisq.test(.$gender, .$pass)),
tidied = map(chi_test, tidy)
) %>%
unnest(tidied)
#> # A tibble: 3 x 7
#> # Groups: department [3]
#> department data chi_test statistic p.value parameter method
#> <chr> <list> <list> <dbl> <dbl> <int> <chr>
#> 1 x <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 2 y <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 3 z <tibble ~ <htest> 1.88e- 1 0.665 1 Pearson's Chi-squar~
由 reprex package (v0.3.0)
于 2020-05-20 创建
如果你想使用基础 R,你可以利用 split
和 lapply
这样的东西:
lapply(split(df, df$department), function(x) { chisq.test(x$gender, x$pass)$p.value })
是的!使用 by
.
res <- do.call(rbind, by(dat, dat$department, function(x) {
c(M=unname(tapply(x$pass, x$department, mean)),
p=chisq.test(x$gender, x$pass)$p.value)
}))
res
# M p
# x 0.6788732 1.484695e-18
# y 0.6516517 3.045009e-22
# z 0.3205128 7.945768e-69
数据:
dat <- read.table(text="department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0", header=T)
set.seed(42)
dat <- dat[sample(1:nrow(dat), 1000, replace=T), ]
不是完全不同的问题答案,而是如果您尝试回答不同问题的答案。 @JasonAizkalns 为每个部门提供了一个优雅的答案,但如果您有兴趣将部门相互比较,则需要针对多重比较进行调整。所以它可能看起来像这样。
library(dplyr)
library(rcompanion)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department, gender) %>%
summarise(Freq = n()) %>%
xtabs(formula = Freq ~ ., data = .) %>%
pairwiseNominalIndependence(x = ., method = "holm", gtest = FALSE)
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Comparison p.Fisher p.adj.Fisher p.Chisq p.adj.Chisq
#> 1 x : y 1 1 1 1
#> 2 x : z 1 1 1 1
#> 3 y : z 1 1 1 1
我有二进制数据取决于个人pass/failed是否测试,以及特征信息(例如性别)和他们属于哪个部门(例如x,y,z)在df(数据)中
head(data,9)
department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0
我可以轻松地 运行 对性别与通过之间的关系进行卡方检验:
chisq.test(data$gender, data$pass)
但是有没有一种方法可以将 运行 分别用于 'department' (x,y,z) 中的值,而不必每次都手动对数据进行子集化?
我可以使用 tapply 创建一个新的数据框来分解每个部门的总体通过率:
as.data.frame(tapply(data$pass, data$department,mean))
但是我有没有办法添加一个新变量来指示上述测试的结果(假设 p 值)?
使用 broom
和 dplyr
是一种优雅的方法。首先,我们按部门变量分组并嵌套我们的数据框。然后我们 运行 chisq.test
对每个 "subset"。最后,为了获得相关统计数据(例如 p.value
),我们利用 broom::tidy
。由于这些都嵌套在每个子集中,我们取消嵌套我们最终想要看到的任何组件。
有关详细信息,请参阅 this vignette
library(tidyverse)
library(broom)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department) %>%
nest() %>%
mutate(
chi_test = map(data, ~ chisq.test(.$gender, .$pass)),
tidied = map(chi_test, tidy)
) %>%
unnest(tidied)
#> # A tibble: 3 x 7
#> # Groups: department [3]
#> department data chi_test statistic p.value parameter method
#> <chr> <list> <list> <dbl> <dbl> <int> <chr>
#> 1 x <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 2 y <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 3 z <tibble ~ <htest> 1.88e- 1 0.665 1 Pearson's Chi-squar~
由 reprex package (v0.3.0)
于 2020-05-20 创建如果你想使用基础 R,你可以利用 split
和 lapply
这样的东西:
lapply(split(df, df$department), function(x) { chisq.test(x$gender, x$pass)$p.value })
是的!使用 by
.
res <- do.call(rbind, by(dat, dat$department, function(x) {
c(M=unname(tapply(x$pass, x$department, mean)),
p=chisq.test(x$gender, x$pass)$p.value)
}))
res
# M p
# x 0.6788732 1.484695e-18
# y 0.6516517 3.045009e-22
# z 0.3205128 7.945768e-69
数据:
dat <- read.table(text="department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0", header=T)
set.seed(42)
dat <- dat[sample(1:nrow(dat), 1000, replace=T), ]
不是完全不同的问题答案,而是如果您尝试回答不同问题的答案。 @JasonAizkalns 为每个部门提供了一个优雅的答案,但如果您有兴趣将部门相互比较,则需要针对多重比较进行调整。所以它可能看起来像这样。
library(dplyr)
library(rcompanion)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department, gender) %>%
summarise(Freq = n()) %>%
xtabs(formula = Freq ~ ., data = .) %>%
pairwiseNominalIndependence(x = ., method = "holm", gtest = FALSE)
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Comparison p.Fisher p.adj.Fisher p.Chisq p.adj.Chisq
#> 1 x : y 1 1 1 1
#> 2 x : z 1 1 1 1
#> 3 y : z 1 1 1 1