R:带子集的 T 统计量
R: T-statistics with subsets
我想要一个 table 作为输出,其中我有某些变量均值差异和基于我的数据的两个特定子集之间的 t 统计量。
我有以下数据:
structure(list(Name = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C", "D", "D", "D", "D"), Date = c("20.10.2018",
"30.09.2018", "25.11.2019", "23.10.2020", "20.03.2018", "30.07.2018",
"25.08.2019", "23.10.2020", "20.12.2018", "30.01.2018", "25.02.2019",
"23.06.2020", "20.11.2018", "30.12.2018", "25.11.2019", "23.09.2020"
), Return = c(0.01, 0.05, 0.08, 0.07, 0.04, 0.03, 0.01, 0.03,
0.03, 0.05, 0.06, 0.07, 0.07, 0.04, 0.06, 0.08), Age = c(5L,
5L, 6L, 7L, 8L, 8L, 9L, 10L, 4L, 4L, 5L, 6L, 1L, 1L, 2L, 3L),
Size = c(53336L, 75768L, 86548L, 94567L, 40234L, 40240L,
50243L, 60352L, 5069L, 6069L, 7092L, 8024L, 2456L, 3046L,
4056L, 5600L), Rating = c(1L, 1L, 1L, 2L, 5L, 5L, 3L, NA,
4L, 5L, 4L, 5L, NA, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-16L))
更具体地说,我想要一个 table,其中我有变量 Return、年龄和大小之间均值差异的 t 统计量,用于评级为 1 的观察值和 5. t 统计量应该是评级 1 和评级 5 之间的列,并且应该包括表示 p 值的星号。
我尝试使用 t.test 函数,但我很难将其仅用于子组,并在评级 1 和评级 5 之间创建 t-statistics 列。
输出应该有这样的布局:
structure(list(c("Return", "Age", "Size"), `Mean Rating 1` = c(NA,
NA, NA), `t-statistics including p-value (indicated as stars)` = c(NA,
NA, NA), `Mean Rating 5` = c(NA, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
有人可以帮我处理代码吗?
非常感谢您。
编辑 22.04.2022:
问题一:
如果我希望输出如下(现在没有值,只是为了说明我想要的布局),我需要如何调整答案中的代码:
structure(list(c("Return", "Age", "Size"), `Mean Rating 1` = c(NA,
NA, NA), `Mean Rating2` = c(NA, NA, NA), `Mean Rating 3` = c(NA,
NA, NA), `Mean Rating 4` = c(NA, NA, NA), `Mean Rating 5` = c(NA,
NA, NA), `Mean Rating NA` = c(NA, NA, NA), `Difference in means Rating 5 and Rating 1` = c(NA,
NA, NA), `p-value for differences in means Rating 5 and Rating 1` = c(NA,
NA, NA), `stars for p-value for differences in means Rating 5 and Rating 1` = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
问题二:
当我想比较两组之间的均值差异时,使用 t 检验还是 F 检验更好?我选择了 t 检验,因为据我所知,如果我想比较两组之间的均值,t 检验是正确的检验。如果我想比较两组的两个标准差,则最好使用 F 检验。我的理解对吗?
您可以轻松地循环 subset=
。
t(with(mtcars, sapply(unique(cyl), \(i) t.test(am, subset=cyl == i))))
# statistic parameter p.value conf.int estimate null.value stderr alternative method data.name
# [1,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [2,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [3,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
更具体的数据你可以这样做:
tcols <- c('Return', 'Age', 'Size')
r <- t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))
cbind(as.data.frame(r),
' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))])
# estimate.mean in group 1 estimate.mean in group 5 statistic.t p.value
# Return 4.666667e-02 0.05 -0.1552301 0.8883096
# Age 5.333333e+00 5.60 -0.2198599 0.8353634
# Size 7.188400e+04 19724.60 4.0457818 0.0109848 *
注意 R >= 4.1 使用。
编辑
as.data.frame(t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))) |>
{\(.) cbind(mean.diff.5.1=apply(.[1:2], 1, diff), .[3:4])}() |>
cbind(' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))],
`colnames<-`(t(aggregate(. ~ Rating, dat[3:6], mean)[-1]),
paste0('mean.rating.', 1:5))) |>
{\(.) .[c(5:9, 1:4)]}()
# mean.rating.1 mean.rating.2 mean.rating.3 mean.rating.4 mean.rating.5 mean.diff.5.1 statistic.t p.value
# Return 4.666667e-02 0.07 0.01 0.0525 0.05 3.333333e-03 -0.1552301 0.8883096
# Age 5.333333e+00 7.00 9.00 3.2500 5.60 2.666667e-01 -0.2198599 0.8353634
# Size 7.188400e+04 94567.00 50243.00 5201.7500 19724.60 -5.215940e+04 4.0457818 0.0109848 *
我想要一个 table 作为输出,其中我有某些变量均值差异和基于我的数据的两个特定子集之间的 t 统计量。
我有以下数据:
structure(list(Name = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C", "D", "D", "D", "D"), Date = c("20.10.2018",
"30.09.2018", "25.11.2019", "23.10.2020", "20.03.2018", "30.07.2018",
"25.08.2019", "23.10.2020", "20.12.2018", "30.01.2018", "25.02.2019",
"23.06.2020", "20.11.2018", "30.12.2018", "25.11.2019", "23.09.2020"
), Return = c(0.01, 0.05, 0.08, 0.07, 0.04, 0.03, 0.01, 0.03,
0.03, 0.05, 0.06, 0.07, 0.07, 0.04, 0.06, 0.08), Age = c(5L,
5L, 6L, 7L, 8L, 8L, 9L, 10L, 4L, 4L, 5L, 6L, 1L, 1L, 2L, 3L),
Size = c(53336L, 75768L, 86548L, 94567L, 40234L, 40240L,
50243L, 60352L, 5069L, 6069L, 7092L, 8024L, 2456L, 3046L,
4056L, 5600L), Rating = c(1L, 1L, 1L, 2L, 5L, 5L, 3L, NA,
4L, 5L, 4L, 5L, NA, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-16L))
更具体地说,我想要一个 table,其中我有变量 Return、年龄和大小之间均值差异的 t 统计量,用于评级为 1 的观察值和 5. t 统计量应该是评级 1 和评级 5 之间的列,并且应该包括表示 p 值的星号。
我尝试使用 t.test 函数,但我很难将其仅用于子组,并在评级 1 和评级 5 之间创建 t-statistics 列。
输出应该有这样的布局:
structure(list(c("Return", "Age", "Size"), `Mean Rating 1` = c(NA,
NA, NA), `t-statistics including p-value (indicated as stars)` = c(NA,
NA, NA), `Mean Rating 5` = c(NA, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
有人可以帮我处理代码吗?
非常感谢您。
编辑 22.04.2022:
问题一: 如果我希望输出如下(现在没有值,只是为了说明我想要的布局),我需要如何调整答案中的代码:
structure(list(c("Return", "Age", "Size"), `Mean Rating 1` = c(NA,
NA, NA), `Mean Rating2` = c(NA, NA, NA), `Mean Rating 3` = c(NA,
NA, NA), `Mean Rating 4` = c(NA, NA, NA), `Mean Rating 5` = c(NA,
NA, NA), `Mean Rating NA` = c(NA, NA, NA), `Difference in means Rating 5 and Rating 1` = c(NA,
NA, NA), `p-value for differences in means Rating 5 and Rating 1` = c(NA,
NA, NA), `stars for p-value for differences in means Rating 5 and Rating 1` = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
问题二: 当我想比较两组之间的均值差异时,使用 t 检验还是 F 检验更好?我选择了 t 检验,因为据我所知,如果我想比较两组之间的均值,t 检验是正确的检验。如果我想比较两组的两个标准差,则最好使用 F 检验。我的理解对吗?
您可以轻松地循环 subset=
。
t(with(mtcars, sapply(unique(cyl), \(i) t.test(am, subset=cyl == i))))
# statistic parameter p.value conf.int estimate null.value stderr alternative method data.name
# [1,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [2,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [3,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
更具体的数据你可以这样做:
tcols <- c('Return', 'Age', 'Size')
r <- t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))
cbind(as.data.frame(r),
' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))])
# estimate.mean in group 1 estimate.mean in group 5 statistic.t p.value
# Return 4.666667e-02 0.05 -0.1552301 0.8883096
# Age 5.333333e+00 5.60 -0.2198599 0.8353634
# Size 7.188400e+04 19724.60 4.0457818 0.0109848 *
注意 R >= 4.1 使用。
编辑
as.data.frame(t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))) |>
{\(.) cbind(mean.diff.5.1=apply(.[1:2], 1, diff), .[3:4])}() |>
cbind(' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))],
`colnames<-`(t(aggregate(. ~ Rating, dat[3:6], mean)[-1]),
paste0('mean.rating.', 1:5))) |>
{\(.) .[c(5:9, 1:4)]}()
# mean.rating.1 mean.rating.2 mean.rating.3 mean.rating.4 mean.rating.5 mean.diff.5.1 statistic.t p.value
# Return 4.666667e-02 0.07 0.01 0.0525 0.05 3.333333e-03 -0.1552301 0.8883096
# Age 5.333333e+00 7.00 9.00 3.2500 5.60 2.666667e-01 -0.2198599 0.8353634
# Size 7.188400e+04 94567.00 50243.00 5201.7500 19724.60 -5.215940e+04 4.0457818 0.0109848 *