使用 lapply 创建 t 检验 table
using lapply to create t-test table
我想在两个人群(在治疗组内或治疗组外(下面的样本数据中分别为 1 或 0))的多个变量之间进行 t 检验,并且对于不同的研究,所有这些都坐着在同一个数据框中。在下面的示例数据中,我想为 1/0 治疗组之间的所有变量(在示例数据中:Age、Dollars、DiseaseCnt)生成 t 检验。我想 运行 这些 t 检验,按程序,而不是整个人群。我有生成 t 测试的逻辑。但是,我需要帮助完成最后一步,即从函数中提取适当的部分并创建一些容易消化的东西table。
最终,我想要的是:table 的 t-stats、p-值、执行 t-test 的变量以及测试变量的程序。
DT<-data.frame(
Treated=sample(0:1,1000,replace=T)
,Program=c('Program A','Program B','Program C','Program D')
,Age=as.integer(rnorm(1000,mean=65,sd=15))
,Dollars=as.integer(rpois(1000,lambda=1000))
,DiseaseCnt=as.integer(rnorm(1000,mean=5,sd=2)) )
progs<-unique(DT$Program) # Pull program names
vars<-names(DT)[3:5] # pull variables to run t tests
test<-lapply(progs, function(i)
tt<-lapply(vars, function(j) {t.test( DT[DT$Treated==1 & DT$Program == i,names(DT)==j]
,DT[DT$Treated==0 & DT$Program == i,names(DT)==j]
,alternative = 'two.sided' )
list(j,tt$statistic,tt$p.value) }
) )
# nested lapply produces results in list format that can be binded, but complete output w/ both lapply's is erroneous
您可以从回归中得到相同的 t 检验;如果您认为不同程序的治疗效果不同,则应包括交互作用。您还可以指定多个响应。
> m <- lm(cbind(Age,Dollars,DiseaseCnt)~Treated * Program - Treated - 1, DT)
> lapply(summary(m), `[[`, "coefficients")
$`Response Age`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 63.0875912409 1.294086510 48.7506752932 1.355786133e-265
ProgramProgram B 65.3846153846 1.400330869 46.6922616771 1.207761156e-252
ProgramProgram C 66.0695652174 1.412455172 46.7763979425 3.534894216e-253
ProgramProgram D 66.6691729323 1.313402302 50.7606640010 5.038015651e-278
Treated:ProgramProgram A 2.8593114140 1.924837595 1.4854819032 1.377339219e-01
Treated:ProgramProgram B -0.9786003470 1.919883369 -0.5097186438 6.103619649e-01
Treated:ProgramProgram C -0.5066022544 1.922108032 -0.2635659631 7.921691261e-01
Treated:ProgramProgram D -2.8657541289 1.919883369 -1.4926709484 1.358412980e-01
$`Response Dollars`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 998.5474452555 2.681598120 372.3702808887 0.0000000000
ProgramProgram B 997.4188034188 2.901757030 343.7292623810 0.0000000000
ProgramProgram C 1001.6869565217 2.926880936 342.2370019265 0.0000000000
ProgramProgram D 1001.2180451128 2.721624185 367.8752013053 0.0000000000
Treated:ProgramProgram A -0.9899231316 3.988636646 -0.2481858388 0.8040419882
Treated:ProgramProgram B 2.5060086113 3.978370529 0.6299082986 0.5288996396
Treated:ProgramProgram C -5.4721417069 3.982980462 -1.3738811324 0.1697889454
Treated:ProgramProgram D -4.0043698991 3.978370529 -1.0065351806 0.3144036460
$`Response DiseaseCnt`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 4.53284671533 0.1793523653 25.27341475576 3.409326912e-109
ProgramProgram B 4.56410256410 0.1940771747 23.51694665775 1.515736580e-97
ProgramProgram C 4.25217391304 0.1957575279 21.72163675698 6.839384262e-86
ProgramProgram D 4.60150375940 0.1820294143 25.27890219412 3.133081901e-109
Treated:ProgramProgram A 0.13087009883 0.2667705543 0.49057175444 6.238378600e-01
Treated:ProgramProgram B -0.02274918064 0.2660839292 -0.08549625944 9.318841210e-01
Treated:ProgramProgram C 0.47375201288 0.2663922537 1.77840010867 7.564438017e-02
Treated:ProgramProgram D -0.31090546880 0.2660839292 -1.16844887901 2.429064705e-01
您特别关心回归 table 的 Treated:Program
个条目。
您应该先将其转换为 data.table
。 (在我的代码中我称你原来的 table DF
):
DT <- as.data.table(DF)
DT[, t.test(data=.SD, Age ~ Treated), by=Program]
Program statistic parameter p.value conf.int estimate null.value alternative
1: Program A -0.6286875 247.8390 0.5301326 -4.8110579 65.26667 0 two.sided
2: Program A -0.6286875 247.8390 0.5301326 2.4828527 66.43077 0 two.sided
3: Program B 1.4758524 230.5380 0.1413480 -0.9069634 67.15315 0 two.sided
4: Program B 1.4758524 230.5380 0.1413480 6.3211834 64.44604 0 two.sided
5: Program C 0.1994182 246.9302 0.8420998 -3.3560930 63.56557 0 two.sided
6: Program C 0.1994182 246.9302 0.8420998 4.1122406 63.18750 0 two.sided
7: Program D -1.1321569 246.0086 0.2586708 -6.1855837 62.31707 0 two.sided
8: Program D -1.1321569 246.0086 0.2586708 1.6701237 64.57480 0 two.sided
method data.name
1: Welch Two Sample t-test Age by Treated
2: Welch Two Sample t-test Age by Treated
3: Welch Two Sample t-test Age by Treated
4: Welch Two Sample t-test Age by Treated
5: Welch Two Sample t-test Age by Treated
6: Welch Two Sample t-test Age by Treated
7: Welch Two Sample t-test Age by Treated
8: Welch Two Sample t-test Age by Treated
在这种格式中,对于每个Program
,statistic
是相同的,等于t
,这里的parameter
是df
,对于 conf.int
,它(按顺序)先低后高(因此对于 Program A
,置信区间是 (-4.8110579, 2.4828527)
,对于 estimate
,它将是 group 0
然后是 group 1
(所以对于 Program A
,Treated == 0
的平均值是 65.26667,等等
这是我能想到的最快的解决方案,您可以遍历 vars
,或者也许有更简单的方法。
编辑:我只确认 Program A
和 Age
,使用以下代码:
DT[Program == 'Program A', t.test(Age ~ Treated)]
Welch Two Sample t-test
data: Age by Treated
t = -0.62869, df = 247.84, p-value = 0.5301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.811058 2.482853
sample estimates:
mean in group 0 mean in group 1
65.26667 66.43077
编辑 2:这是循环遍历您的变量并将它们 rbind
组合在一起的代码:
do.call(rbind, lapply(vars, function(x) DT[, t.test(data=.SD, eval(parse(text=x)) ~ Treated), by=Program]))
您遇到错误是因为您试图从在创建的函数中访问tt$statistic
tt
。一些包围问题。
这是按照您的版本执行此操作的一种方法
results <- lapply(progs, function (i) {
DS = subset(DT, Program == i)
o <- lapply(vars, function (i) {
frm <- formula(paste0(i, '~ Treated'))
tt <- t.test(frm, DS)
data.frame(Variable=i, T=tt$statistic, P=tt$p.value)
})
o <- do.call(rbind, o)
o$Program <- i
o
})
do.call(rbind, results)
或者您可以使用(例如)ddply rbind
-ing 来完成(我认为 rbinding 仍然发生,只是在幕后):
library(plyr)
combinations <- expand.grid(Program=progs, Y=vars)
ddply(combinations, .(Program, Y),
function (x) {
# x is a dataframe with the program and variable;
# just do the t-test and add the statistic & p-val to it
frm <- formula(paste0(x$Y, '~ Treated'))
tt <- t.test(frm, subset(DT, Program == x$Program))
x$T <- tt$statistic
x$P <- tt$p.value
x
})
我想在两个人群(在治疗组内或治疗组外(下面的样本数据中分别为 1 或 0))的多个变量之间进行 t 检验,并且对于不同的研究,所有这些都坐着在同一个数据框中。在下面的示例数据中,我想为 1/0 治疗组之间的所有变量(在示例数据中:Age、Dollars、DiseaseCnt)生成 t 检验。我想 运行 这些 t 检验,按程序,而不是整个人群。我有生成 t 测试的逻辑。但是,我需要帮助完成最后一步,即从函数中提取适当的部分并创建一些容易消化的东西table。
最终,我想要的是:table 的 t-stats、p-值、执行 t-test 的变量以及测试变量的程序。
DT<-data.frame(
Treated=sample(0:1,1000,replace=T)
,Program=c('Program A','Program B','Program C','Program D')
,Age=as.integer(rnorm(1000,mean=65,sd=15))
,Dollars=as.integer(rpois(1000,lambda=1000))
,DiseaseCnt=as.integer(rnorm(1000,mean=5,sd=2)) )
progs<-unique(DT$Program) # Pull program names
vars<-names(DT)[3:5] # pull variables to run t tests
test<-lapply(progs, function(i)
tt<-lapply(vars, function(j) {t.test( DT[DT$Treated==1 & DT$Program == i,names(DT)==j]
,DT[DT$Treated==0 & DT$Program == i,names(DT)==j]
,alternative = 'two.sided' )
list(j,tt$statistic,tt$p.value) }
) )
# nested lapply produces results in list format that can be binded, but complete output w/ both lapply's is erroneous
您可以从回归中得到相同的 t 检验;如果您认为不同程序的治疗效果不同,则应包括交互作用。您还可以指定多个响应。
> m <- lm(cbind(Age,Dollars,DiseaseCnt)~Treated * Program - Treated - 1, DT)
> lapply(summary(m), `[[`, "coefficients")
$`Response Age`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 63.0875912409 1.294086510 48.7506752932 1.355786133e-265
ProgramProgram B 65.3846153846 1.400330869 46.6922616771 1.207761156e-252
ProgramProgram C 66.0695652174 1.412455172 46.7763979425 3.534894216e-253
ProgramProgram D 66.6691729323 1.313402302 50.7606640010 5.038015651e-278
Treated:ProgramProgram A 2.8593114140 1.924837595 1.4854819032 1.377339219e-01
Treated:ProgramProgram B -0.9786003470 1.919883369 -0.5097186438 6.103619649e-01
Treated:ProgramProgram C -0.5066022544 1.922108032 -0.2635659631 7.921691261e-01
Treated:ProgramProgram D -2.8657541289 1.919883369 -1.4926709484 1.358412980e-01
$`Response Dollars`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 998.5474452555 2.681598120 372.3702808887 0.0000000000
ProgramProgram B 997.4188034188 2.901757030 343.7292623810 0.0000000000
ProgramProgram C 1001.6869565217 2.926880936 342.2370019265 0.0000000000
ProgramProgram D 1001.2180451128 2.721624185 367.8752013053 0.0000000000
Treated:ProgramProgram A -0.9899231316 3.988636646 -0.2481858388 0.8040419882
Treated:ProgramProgram B 2.5060086113 3.978370529 0.6299082986 0.5288996396
Treated:ProgramProgram C -5.4721417069 3.982980462 -1.3738811324 0.1697889454
Treated:ProgramProgram D -4.0043698991 3.978370529 -1.0065351806 0.3144036460
$`Response DiseaseCnt`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 4.53284671533 0.1793523653 25.27341475576 3.409326912e-109
ProgramProgram B 4.56410256410 0.1940771747 23.51694665775 1.515736580e-97
ProgramProgram C 4.25217391304 0.1957575279 21.72163675698 6.839384262e-86
ProgramProgram D 4.60150375940 0.1820294143 25.27890219412 3.133081901e-109
Treated:ProgramProgram A 0.13087009883 0.2667705543 0.49057175444 6.238378600e-01
Treated:ProgramProgram B -0.02274918064 0.2660839292 -0.08549625944 9.318841210e-01
Treated:ProgramProgram C 0.47375201288 0.2663922537 1.77840010867 7.564438017e-02
Treated:ProgramProgram D -0.31090546880 0.2660839292 -1.16844887901 2.429064705e-01
您特别关心回归 table 的 Treated:Program
个条目。
您应该先将其转换为 data.table
。 (在我的代码中我称你原来的 table DF
):
DT <- as.data.table(DF)
DT[, t.test(data=.SD, Age ~ Treated), by=Program]
Program statistic parameter p.value conf.int estimate null.value alternative
1: Program A -0.6286875 247.8390 0.5301326 -4.8110579 65.26667 0 two.sided
2: Program A -0.6286875 247.8390 0.5301326 2.4828527 66.43077 0 two.sided
3: Program B 1.4758524 230.5380 0.1413480 -0.9069634 67.15315 0 two.sided
4: Program B 1.4758524 230.5380 0.1413480 6.3211834 64.44604 0 two.sided
5: Program C 0.1994182 246.9302 0.8420998 -3.3560930 63.56557 0 two.sided
6: Program C 0.1994182 246.9302 0.8420998 4.1122406 63.18750 0 two.sided
7: Program D -1.1321569 246.0086 0.2586708 -6.1855837 62.31707 0 two.sided
8: Program D -1.1321569 246.0086 0.2586708 1.6701237 64.57480 0 two.sided
method data.name
1: Welch Two Sample t-test Age by Treated
2: Welch Two Sample t-test Age by Treated
3: Welch Two Sample t-test Age by Treated
4: Welch Two Sample t-test Age by Treated
5: Welch Two Sample t-test Age by Treated
6: Welch Two Sample t-test Age by Treated
7: Welch Two Sample t-test Age by Treated
8: Welch Two Sample t-test Age by Treated
在这种格式中,对于每个Program
,statistic
是相同的,等于t
,这里的parameter
是df
,对于 conf.int
,它(按顺序)先低后高(因此对于 Program A
,置信区间是 (-4.8110579, 2.4828527)
,对于 estimate
,它将是 group 0
然后是 group 1
(所以对于 Program A
,Treated == 0
的平均值是 65.26667,等等
这是我能想到的最快的解决方案,您可以遍历 vars
,或者也许有更简单的方法。
编辑:我只确认 Program A
和 Age
,使用以下代码:
DT[Program == 'Program A', t.test(Age ~ Treated)]
Welch Two Sample t-test
data: Age by Treated
t = -0.62869, df = 247.84, p-value = 0.5301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.811058 2.482853
sample estimates:
mean in group 0 mean in group 1
65.26667 66.43077
编辑 2:这是循环遍历您的变量并将它们 rbind
组合在一起的代码:
do.call(rbind, lapply(vars, function(x) DT[, t.test(data=.SD, eval(parse(text=x)) ~ Treated), by=Program]))
您遇到错误是因为您试图从在创建的函数中访问tt$statistic
tt
。一些包围问题。
这是按照您的版本执行此操作的一种方法
results <- lapply(progs, function (i) {
DS = subset(DT, Program == i)
o <- lapply(vars, function (i) {
frm <- formula(paste0(i, '~ Treated'))
tt <- t.test(frm, DS)
data.frame(Variable=i, T=tt$statistic, P=tt$p.value)
})
o <- do.call(rbind, o)
o$Program <- i
o
})
do.call(rbind, results)
或者您可以使用(例如)ddply rbind
-ing 来完成(我认为 rbinding 仍然发生,只是在幕后):
library(plyr)
combinations <- expand.grid(Program=progs, Y=vars)
ddply(combinations, .(Program, Y),
function (x) {
# x is a dataframe with the program and variable;
# just do the t-test and add the statistic & p-val to it
frm <- formula(paste0(x$Y, '~ Treated'))
tt <- t.test(frm, subset(DT, Program == x$Program))
x$T <- tt$statistic
x$P <- tt$p.value
x
})