如何在 R 中的数据框中的数据子集之间进行 t 检验?
How can I do a t-test between sub-set of data in dataframe in R?
我有一个这样的 df1:
Stabr Area_name Score1 Score2 POVALL_2018 Score3
3 AL Autauga County 2 2 7,587 13.8
4 AL Baldwin County 2 2 21,069 9.8
7 AL Blount County 2 1 7,527 13.2
8 AL Bullock County 3 6 3,610 42.5
9 AL Butler County 3 6 4,731 24.5
10 AL Calhoun County 3 2 21,719 19.5
11 AL Chambers County 6 5 6,181 18.7
12 AL Cherokee County 2 6 4,180 16.3
13 AL Chilton County 2 1 7,542 17.3
14 AL Choctaw County 3 10 2,806 22.1
16 AL Clay County 9 10 2,285 17.6
17 AL Cleburne County 8 4 2,356 16.0
我只关心列 score1
和 score3
。我想在其中执行一个简单的 t 检验,看看 score1
为 2
的所有县与 score1
的所有县相比是否有不同的 score3
共 3 个
非常具体,我想看看 13.8、9.8、13.2、16.3、17.3 的平均值是否与 42.5、24.5、19.5、22.1 的平均值有显着差异。我怎样才能做到这一点?我想忽略 score1
不同于 2 或 3 的所有行。
这是怎么做到的?
您可以对数据框进行子集化并执行 t.test:
df1 <- subset(df, Score1 %in% 2:3)
Stabr Area_name Score1 Score2 POVALL_2018 Score3
1: AL AutaugaCounty 2 2 7,587 13.8
2: AL BaldwinCounty 2 2 21,069 9.8
3: AL BlountCounty 2 1 7,527 13.2
4: AL BullockCounty 3 6 3,610 42.5
5: AL ButlerCounty 3 6 4,731 24.5
6: AL CalhounCounty 3 2 21,719 19.5
7: AL CherokeeCounty 2 6 4,180 16.3
8: AL ChiltonCounty 2 1 7,542 17.3
9: AL ChoctawCounty 3 10 2,806 22.1
然后执行 t.test:
t.test(Score3~Score1,data = df1)
Welch Two Sample t-test
data: Score3 by Score1
t = -2.4293, df = 3.3817, p-value = 0.08372
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-29.148945 3.008945
sample estimates:
mean in group 2 mean in group 3
14.08 27.15
由于每组样本不多,我(个人)更喜欢使用非参数检验,例如 Mann-Whitney(具有函数 wilcox.test
):
wilcox.test(Score3~Score1,data = df1)
Wilcoxon rank sum test
data: Score3 by Score1
W = 0, p-value = 0.01587
alternative hypothesis: true location shift is not equal to 0
编辑:t.test 基于 Score1 的值(OP 的评论)
如果要测试所有值 < 3 和所有值 > 或 =3,您可以添加一个带有 ifelse
语句的变量,例如:
df$Group <- ifelse(df$Score1 <3,"A","B")
t.test(Score3~Group,data = df)
Welch Two Sample t-test
data: Score3 by Group
t = -2.429, df = 7.6464, p-value = 0.04262
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.429041 -0.382388
sample estimates:
mean in group A mean in group B
14.08000 22.98571
它能回答您的问题吗?
可重现的例子:
structure(list(Stabr = c("AL", "AL", "AL", "AL", "AL", "AL",
"AL", "AL", "AL", "AL", "AL", "AL"), Area_name = c("AutaugaCounty",
"BaldwinCounty", "BlountCounty", "BullockCounty", "ButlerCounty",
"CalhounCounty", "ChambersCounty", "CherokeeCounty", "ChiltonCounty",
"ChoctawCounty", "ClayCounty", "CleburneCounty"), Score1 = c(2L,
2L, 2L, 3L, 3L, 3L, 6L, 2L, 2L, 3L, 9L, 8L), Score2 = c(2L, 2L,
1L, 6L, 6L, 2L, 5L, 6L, 1L, 10L, 10L, 4L), POVALL_2018 = c("7,587",
"21,069", "7,527", "3,610", "4,731", "21,719", "6,181", "4,180",
"7,542", "2,806", "2,285", "2,356"), Score3 = c(13.8, 9.8, 13.2,
42.5, 24.5, 19.5, 18.7, 16.3, 17.3, 22.1, 17.6, 16)), row.names = c(NA,
-12L), class = c("data.table", "data.frame"))
我有一个这样的 df1:
Stabr Area_name Score1 Score2 POVALL_2018 Score3
3 AL Autauga County 2 2 7,587 13.8
4 AL Baldwin County 2 2 21,069 9.8
7 AL Blount County 2 1 7,527 13.2
8 AL Bullock County 3 6 3,610 42.5
9 AL Butler County 3 6 4,731 24.5
10 AL Calhoun County 3 2 21,719 19.5
11 AL Chambers County 6 5 6,181 18.7
12 AL Cherokee County 2 6 4,180 16.3
13 AL Chilton County 2 1 7,542 17.3
14 AL Choctaw County 3 10 2,806 22.1
16 AL Clay County 9 10 2,285 17.6
17 AL Cleburne County 8 4 2,356 16.0
我只关心列 score1
和 score3
。我想在其中执行一个简单的 t 检验,看看 score1
为 2
的所有县与 score1
的所有县相比是否有不同的 score3
共 3 个
非常具体,我想看看 13.8、9.8、13.2、16.3、17.3 的平均值是否与 42.5、24.5、19.5、22.1 的平均值有显着差异。我怎样才能做到这一点?我想忽略 score1
不同于 2 或 3 的所有行。
这是怎么做到的?
您可以对数据框进行子集化并执行 t.test:
df1 <- subset(df, Score1 %in% 2:3)
Stabr Area_name Score1 Score2 POVALL_2018 Score3
1: AL AutaugaCounty 2 2 7,587 13.8
2: AL BaldwinCounty 2 2 21,069 9.8
3: AL BlountCounty 2 1 7,527 13.2
4: AL BullockCounty 3 6 3,610 42.5
5: AL ButlerCounty 3 6 4,731 24.5
6: AL CalhounCounty 3 2 21,719 19.5
7: AL CherokeeCounty 2 6 4,180 16.3
8: AL ChiltonCounty 2 1 7,542 17.3
9: AL ChoctawCounty 3 10 2,806 22.1
然后执行 t.test:
t.test(Score3~Score1,data = df1)
Welch Two Sample t-test
data: Score3 by Score1
t = -2.4293, df = 3.3817, p-value = 0.08372
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-29.148945 3.008945
sample estimates:
mean in group 2 mean in group 3
14.08 27.15
由于每组样本不多,我(个人)更喜欢使用非参数检验,例如 Mann-Whitney(具有函数 wilcox.test
):
wilcox.test(Score3~Score1,data = df1)
Wilcoxon rank sum test
data: Score3 by Score1
W = 0, p-value = 0.01587
alternative hypothesis: true location shift is not equal to 0
编辑:t.test 基于 Score1 的值(OP 的评论)
如果要测试所有值 < 3 和所有值 > 或 =3,您可以添加一个带有 ifelse
语句的变量,例如:
df$Group <- ifelse(df$Score1 <3,"A","B")
t.test(Score3~Group,data = df)
Welch Two Sample t-test
data: Score3 by Group
t = -2.429, df = 7.6464, p-value = 0.04262
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.429041 -0.382388
sample estimates:
mean in group A mean in group B
14.08000 22.98571
它能回答您的问题吗?
可重现的例子:
structure(list(Stabr = c("AL", "AL", "AL", "AL", "AL", "AL",
"AL", "AL", "AL", "AL", "AL", "AL"), Area_name = c("AutaugaCounty",
"BaldwinCounty", "BlountCounty", "BullockCounty", "ButlerCounty",
"CalhounCounty", "ChambersCounty", "CherokeeCounty", "ChiltonCounty",
"ChoctawCounty", "ClayCounty", "CleburneCounty"), Score1 = c(2L,
2L, 2L, 3L, 3L, 3L, 6L, 2L, 2L, 3L, 9L, 8L), Score2 = c(2L, 2L,
1L, 6L, 6L, 2L, 5L, 6L, 1L, 10L, 10L, 4L), POVALL_2018 = c("7,587",
"21,069", "7,527", "3,610", "4,731", "21,719", "6,181", "4,180",
"7,542", "2,806", "2,285", "2,356"), Score3 = c(13.8, 9.8, 13.2,
42.5, 24.5, 19.5, 18.7, 16.3, 17.3, 22.1, 17.6, 16)), row.names = c(NA,
-12L), class = c("data.table", "data.frame"))