t-tests 在多列上的一个微妙问题
A subtle problem with t-tests over multiple columns
我有一个数据框可以回答多个问题(下面有 2 个问题的可重现示例)
set.seed(1)
df <- data.frame (
UserId = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
Sex = c(rep("Female", 8), rep("Male", 4), rep("No_Response", 4)),
Answer_Date = as.Date(c("1990-01-01", "1990-02-01", "1990-03-01", "1990-04-01",
"1991-02-01", "1991-03-01", "1991-04-01", "1991-05-01",
"1992-03-01", "1992-04-01", "1992-05-01", "1992-06-01",
"1993-07-10", "1992-08-10", "1993-09-10", "1993-10-10")),
Q1 = sample(1:10, 16, replace = TRUE),
Q2 = sample(1:10, 16, replace = TRUE)
) %>%
group_by(UserId) %>%
mutate(First_Answer_Date = min(Answer_Date)) %>%
mutate(Last_Answer_Date = max(Answer_Date)) %>%
ungroup()
遵循
中的建议
https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/
I 运行 t-tests 对于 Q1 和 Q2 反对真实均值为 0 的原假设:
questions <- c("Q1", "Q2")
df %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p_Male = t.test(unlist(Male) )$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male) )$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male))
)
这给了我
# A tibble: 2 x 7
# Groups: variable [2]
variable Female Male p_Female p__Male t_Female t_Male
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Q1 8 4 0.0000501 0.00137 8.78 11.6
2 Q2 8 4 0.00217 0.0115 4.71 5.55
到目前为止一切都很好。当我只想在 First_Answer_Date.
上执行 t-test 时,我的麻烦就开始了
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
# A tibble: 3 x 3
Q1 Q2 Sex
<int> <int> <chr>
1 9 5 Female
2 2 5 Female
3 1 9 Male
现在,男性只有一个回复,女性有两个回复,而在 Q2 中,两位女性受访者的答案相同。如果我重新运行 我的t-test 代码
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p__Male = t.test(unlist(Male))$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male))$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male)))
Error: Problem with `mutate()` input `p_Female`.
x data are essentially constant
i Input `p_Female` is `t.test(unlist(Female))$p.value`.
i The error occurred in group 2: variable = "Q2".
我收到的错误消息是合乎逻辑的,但这是我在实践中可能遇到的情况 - 某些子集的大小可能为 1 或 0,某些问题的所有受访者可能会给出相同的答案等. 等等 我怎样才能使代码优雅地降级,只需在其输出小标题中的那些单元格中放置一个空白或 NA,因为这样或那样的原因无法计算出答案?
此致
托马斯·飞利浦
也许,您可以使用tryCatch
来处理错误:
library(dplyr)
library(tidyr)
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
pivot_longer(cols = -Sex, names_to = "variable") %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
pivot_wider(names_from = Sex, values_from = value) %>%
group_by(variable) %>%
mutate( p_Female = tryCatch(t.test(unlist(Female))$p.value, error = function(e) return(NA)),
p_Male = tryCatch(t.test(unlist(Male) )$p.value, error = function(e) return(NA)),
t_Female = tryCatch(t.test(unlist(Female))$statistic, error = function(e) return(NA)),
t_Male = tryCatch(t.test(unlist(Male))$statistic,error = function(e) return(NA))) %>%
ungroup %>%
mutate( Female = lengths(Female),
Male = lengths(Male))
我有一个数据框可以回答多个问题(下面有 2 个问题的可重现示例)
set.seed(1)
df <- data.frame (
UserId = c(rep("A", 4), rep("B", 4), rep("C", 4), rep("D", 4)),
Sex = c(rep("Female", 8), rep("Male", 4), rep("No_Response", 4)),
Answer_Date = as.Date(c("1990-01-01", "1990-02-01", "1990-03-01", "1990-04-01",
"1991-02-01", "1991-03-01", "1991-04-01", "1991-05-01",
"1992-03-01", "1992-04-01", "1992-05-01", "1992-06-01",
"1993-07-10", "1992-08-10", "1993-09-10", "1993-10-10")),
Q1 = sample(1:10, 16, replace = TRUE),
Q2 = sample(1:10, 16, replace = TRUE)
) %>%
group_by(UserId) %>%
mutate(First_Answer_Date = min(Answer_Date)) %>%
mutate(Last_Answer_Date = max(Answer_Date)) %>%
ungroup()
遵循
中的建议https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/
I 运行 t-tests 对于 Q1 和 Q2 反对真实均值为 0 的原假设:
questions <- c("Q1", "Q2")
df %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p_Male = t.test(unlist(Male) )$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male) )$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male))
)
这给了我
# A tibble: 2 x 7
# Groups: variable [2]
variable Female Male p_Female p__Male t_Female t_Male
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 Q1 8 4 0.0000501 0.00137 8.78 11.6
2 Q2 8 4 0.00217 0.0115 4.71 5.55
到目前为止一切都很好。当我只想在 First_Answer_Date.
上执行 t-test 时,我的麻烦就开始了df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
# A tibble: 3 x 3
Q1 Q2 Sex
<int> <int> <chr>
1 9 5 Female
2 2 5 Female
3 1 9 Male
现在,男性只有一个回复,女性有两个回复,而在 Q2 中,两位女性受访者的答案相同。如果我重新运行 我的t-test 代码
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
gather(key = variable, value = value, -Sex) %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
spread(Sex, value) %>%
group_by(variable) %>%
mutate( p_Female = t.test(unlist(Female))$p.value,
p__Male = t.test(unlist(Male))$p.value,
t_Female = t.test(unlist(Female))$statistic,
t_Male = t.test(unlist(Male))$statistic) %>%
mutate( Female = length(unlist(Female)),
Male = length(unlist(Male)))
Error: Problem with `mutate()` input `p_Female`.
x data are essentially constant
i Input `p_Female` is `t.test(unlist(Female))$p.value`.
i The error occurred in group 2: variable = "Q2".
我收到的错误消息是合乎逻辑的,但这是我在实践中可能遇到的情况 - 某些子集的大小可能为 1 或 0,某些问题的所有受访者可能会给出相同的答案等. 等等 我怎样才能使代码优雅地降级,只需在其输出小标题中的那些单元格中放置一个空白或 NA,因为这样或那样的原因无法计算出答案?
此致
托马斯·飞利浦
也许,您可以使用tryCatch
来处理错误:
library(dplyr)
library(tidyr)
df %>%
filter(Answer_Date == First_Answer_Date) %>%
select(questions, Sex) %>%
filter(Sex != "No_Response") %>%
pivot_longer(cols = -Sex, names_to = "variable") %>%
group_by(Sex, variable) %>%
summarize(value = list(value)) %>%
pivot_wider(names_from = Sex, values_from = value) %>%
group_by(variable) %>%
mutate( p_Female = tryCatch(t.test(unlist(Female))$p.value, error = function(e) return(NA)),
p_Male = tryCatch(t.test(unlist(Male) )$p.value, error = function(e) return(NA)),
t_Female = tryCatch(t.test(unlist(Female))$statistic, error = function(e) return(NA)),
t_Male = tryCatch(t.test(unlist(Male))$statistic,error = function(e) return(NA))) %>%
ungroup %>%
mutate( Female = lengths(Female),
Male = lengths(Male))