Kruskal - 具有 R 的数据子集的 Wallis p 值矩阵
Kruskal - Wallis p-value matrix for data subsets with R
考虑一个数据集 Data
,它具有多个因子和多个数值连续变量。其中一些变量,比方说 slice_by_1
(类 "Male","Female")和 slice_by_2
(类 "Sad", "Neutral"、"Happy"),用于将数据'slice'分成子集。对于每个子集,Kruskal-Wallis 检验应该在变量 length
、preasure
、pulse
上 运行,每个变量由另一个名为 compare_by
的因子变量分组。在 R 中是否有快速的方法来完成此任务并将计算的 p 值放入矩阵?
我使用了dplyr
包来准备数据。
样本数据集:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
slice_by_1 = as.factor(rep(c("Male", "Female"), times = 120)),
slice_by_2 = as.factor(rep(c("Happy", "Neutral", "Sad"), each = 80)),
compare_by = as.factor(rep(c("blue", "green", "brown"), times = 80)),
length = c(sample(1:10, 120, replace=T), sample(5:12, 120, replace=T)),
pulse = runif(240, 60, 120),
preasure = c(rnorm(80,1,2),rnorm(80,1,2.1),rnorm(80,1,3))
)
) %>%
group_by(slice_by_1, slice_by_2)
我们来看数据:
Source: local data frame [240 x 6]
Groups: slice_by_1, slice_by_2
slice_by_1 slice_by_2 compare_by length pulse preasure
1 Male Happy blue 10 69.23376 0.508694601
2 Female Happy green 1 68.57866 -1.155632020
3 Male Happy brown 8 112.72132 0.007031799
4 Female Happy blue 3 116.61283 0.383769524
5 Male Happy green 7 110.06851 -0.717791526
6 Female Happy brown 8 117.62481 2.938658488
7 Male Happy blue 9 105.59749 0.735831389
8 Female Happy green 2 83.44101 3.881268679
9 Male Happy brown 5 101.48334 0.025572561
10 Female Happy blue 10 62.87331 -0.715108893
.. ... ... ... ... ... ...
所需输出示例:
Data_subsets length preasure pulse
1 Male_Happy <p-value> <p-value> <p-value>
2 Female_Happy <p-value> <p-value> <p-value>
3 Male_Neutral <p-value> <p-value> <p-value>
4 Female_Neutral <p-value> <p-value> <p-value>
5 Male_Sad <p-value> <p-value> <p-value>
6 Female_Sad <p-value> <p-value> <p-value>
你已经用 group_by
了大部分,现在你只需要 do
它:
Data %>%
do({
data.frame(
Data_subsets=paste(.$slice_by_1[[1]], .$slice_by_2[[1]], sep='_'),
length=kruskal.test(.$length, .$compare_by)$p.value,
preasure=kruskal.test(.$preasure, .$compare_by)$p.value,
pulse=kruskal.test(.$pulse, .$compare_by)$p.value,
stringsAsFactors=FALSE)
}) %>%
ungroup() %>%
select(-starts_with("slice_"))
## Source: local data frame [6 x 4]
## Data_subsets length preasure pulse
## 1 Female_Happy 0.4369918 0.1937327 0.8767561
## 2 Female_Neutral 0.3750688 0.8588069 0.2858796
## 3 Female_Sad 0.7958502 0.6274940 0.5801208
## 4 Male_Happy 0.3099704 0.6929493 0.3796494
## 5 Male_Neutral 0.4953853 0.2986860 0.2418708
## 6 Male_Sad 0.7159970 0.8528201 0.5686672
您必须执行 ungroup()
来删除 slice*
列,因为 group_by
列没有被删除(我想说 "never dropped",但是我不确定。
我们可以在 do
中使用 Map
来完成多列 kruskal.test
,然后使用 library(tidyr)
中的 unite
加入 'slice_by_1' 和 'slice_by_2' 列到单个列 'Data_subsets'.
library(dplyr)
library(tidyr)
nm1 <- names(Data)[4:6]
f1 <- function(x,y) kruskal.test(x~y)$p.value
Data %>%
do({data.frame(Map(f1, .[nm1], list(.$compare_by)))}) %>%
unite(Data_subsets, slice_by_1, slice_by_2, sep="_")
# Data_subsets length pulse preasure
#1 Female_Happy 0.4369918 0.8767561 0.1937327
#2 Female_Neutral 0.3750688 0.2858796 0.8588069
#3 Female_Sad 0.7958502 0.5801208 0.6274940
#4 Male_Happy 0.3099704 0.3796494 0.6929493
#5 Male_Neutral 0.4953853 0.2418708 0.2986860
#6 Male_Sad 0.7159970 0.5686672 0.8528201
或者我们可以使用 data.table
来做到这一点。我们将 'data.frame' 转换为 'data.table' (setDT(Data)
),通过 paste
ing 'slice_by_1' 和 [=29= 创建分组变量 ('Data_subsets') ] 列,然后我们对数据集的列进行子集化并将其作为输入传递给 Map
,执行 krusal.test
并提取 p.value
.
library(data.table)
setDT(Data)[, Map(f1, .SD[, nm1, with=FALSE], list(compare_by)) ,
by = .(Data_subsets= paste(slice_by_1, slice_by_2, sep='_'))]
# Data_subsets length pulse preasure
#1: Male_Happy 0.3099704 0.3796494 0.6929493
#2: Female_Happy 0.4369918 0.8767561 0.1937327
#3: Male_Neutral 0.4953853 0.2418708 0.2986860
#4: Female_Neutral 0.3750688 0.2858796 0.8588069
#5: Male_Sad 0.7159970 0.5686672 0.8528201
#6: Female_Sad 0.7958502 0.5801208 0.6274940
考虑一个数据集 Data
,它具有多个因子和多个数值连续变量。其中一些变量,比方说 slice_by_1
(类 "Male","Female")和 slice_by_2
(类 "Sad", "Neutral"、"Happy"),用于将数据'slice'分成子集。对于每个子集,Kruskal-Wallis 检验应该在变量 length
、preasure
、pulse
上 运行,每个变量由另一个名为 compare_by
的因子变量分组。在 R 中是否有快速的方法来完成此任务并将计算的 p 值放入矩阵?
我使用了dplyr
包来准备数据。
样本数据集:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
slice_by_1 = as.factor(rep(c("Male", "Female"), times = 120)),
slice_by_2 = as.factor(rep(c("Happy", "Neutral", "Sad"), each = 80)),
compare_by = as.factor(rep(c("blue", "green", "brown"), times = 80)),
length = c(sample(1:10, 120, replace=T), sample(5:12, 120, replace=T)),
pulse = runif(240, 60, 120),
preasure = c(rnorm(80,1,2),rnorm(80,1,2.1),rnorm(80,1,3))
)
) %>%
group_by(slice_by_1, slice_by_2)
我们来看数据:
Source: local data frame [240 x 6]
Groups: slice_by_1, slice_by_2
slice_by_1 slice_by_2 compare_by length pulse preasure
1 Male Happy blue 10 69.23376 0.508694601
2 Female Happy green 1 68.57866 -1.155632020
3 Male Happy brown 8 112.72132 0.007031799
4 Female Happy blue 3 116.61283 0.383769524
5 Male Happy green 7 110.06851 -0.717791526
6 Female Happy brown 8 117.62481 2.938658488
7 Male Happy blue 9 105.59749 0.735831389
8 Female Happy green 2 83.44101 3.881268679
9 Male Happy brown 5 101.48334 0.025572561
10 Female Happy blue 10 62.87331 -0.715108893
.. ... ... ... ... ... ...
所需输出示例:
Data_subsets length preasure pulse
1 Male_Happy <p-value> <p-value> <p-value>
2 Female_Happy <p-value> <p-value> <p-value>
3 Male_Neutral <p-value> <p-value> <p-value>
4 Female_Neutral <p-value> <p-value> <p-value>
5 Male_Sad <p-value> <p-value> <p-value>
6 Female_Sad <p-value> <p-value> <p-value>
你已经用 group_by
了大部分,现在你只需要 do
它:
Data %>%
do({
data.frame(
Data_subsets=paste(.$slice_by_1[[1]], .$slice_by_2[[1]], sep='_'),
length=kruskal.test(.$length, .$compare_by)$p.value,
preasure=kruskal.test(.$preasure, .$compare_by)$p.value,
pulse=kruskal.test(.$pulse, .$compare_by)$p.value,
stringsAsFactors=FALSE)
}) %>%
ungroup() %>%
select(-starts_with("slice_"))
## Source: local data frame [6 x 4]
## Data_subsets length preasure pulse
## 1 Female_Happy 0.4369918 0.1937327 0.8767561
## 2 Female_Neutral 0.3750688 0.8588069 0.2858796
## 3 Female_Sad 0.7958502 0.6274940 0.5801208
## 4 Male_Happy 0.3099704 0.6929493 0.3796494
## 5 Male_Neutral 0.4953853 0.2986860 0.2418708
## 6 Male_Sad 0.7159970 0.8528201 0.5686672
您必须执行 ungroup()
来删除 slice*
列,因为 group_by
列没有被删除(我想说 "never dropped",但是我不确定。
我们可以在 do
中使用 Map
来完成多列 kruskal.test
,然后使用 library(tidyr)
中的 unite
加入 'slice_by_1' 和 'slice_by_2' 列到单个列 'Data_subsets'.
library(dplyr)
library(tidyr)
nm1 <- names(Data)[4:6]
f1 <- function(x,y) kruskal.test(x~y)$p.value
Data %>%
do({data.frame(Map(f1, .[nm1], list(.$compare_by)))}) %>%
unite(Data_subsets, slice_by_1, slice_by_2, sep="_")
# Data_subsets length pulse preasure
#1 Female_Happy 0.4369918 0.8767561 0.1937327
#2 Female_Neutral 0.3750688 0.2858796 0.8588069
#3 Female_Sad 0.7958502 0.5801208 0.6274940
#4 Male_Happy 0.3099704 0.3796494 0.6929493
#5 Male_Neutral 0.4953853 0.2418708 0.2986860
#6 Male_Sad 0.7159970 0.5686672 0.8528201
或者我们可以使用 data.table
来做到这一点。我们将 'data.frame' 转换为 'data.table' (setDT(Data)
),通过 paste
ing 'slice_by_1' 和 [=29= 创建分组变量 ('Data_subsets') ] 列,然后我们对数据集的列进行子集化并将其作为输入传递给 Map
,执行 krusal.test
并提取 p.value
.
library(data.table)
setDT(Data)[, Map(f1, .SD[, nm1, with=FALSE], list(compare_by)) ,
by = .(Data_subsets= paste(slice_by_1, slice_by_2, sep='_'))]
# Data_subsets length pulse preasure
#1: Male_Happy 0.3099704 0.3796494 0.6929493
#2: Female_Happy 0.4369918 0.8767561 0.1937327
#3: Male_Neutral 0.4953853 0.2418708 0.2986860
#4: Female_Neutral 0.3750688 0.2858796 0.8588069
#5: Male_Sad 0.7159970 0.5686672 0.8528201
#6: Female_Sad 0.7958502 0.5801208 0.6274940