确定每个聚类的每列中值的百分比

Determining the percentage of values in each column for each cluster

我需要确定具有条件的每个集群的每列中值的百分比。可重现的例子如下。我有一个 table 这样的:

> tab
            GI     RT     TR    VR Cluster_number
1   1000086986 0.5814 0.5814 0.628              1
10  1000728257 0.5814 0.5814 0.628              1
13  1000074769 0.7879 0.7879 0.443              2
14  1000498642 0.7879 0.7879 0.443              2
22  1000074765 0.7941 0.3600 0.533              3
26  1000597385 0.7941 0.3600 0.533              3
31  1000502373 0.5000 0.5000 0.607              4
32  1000532631 0.6875 0.7059 0.607              4
33  1000597694 0.5000 0.5000 0.607              4
34  1000598724 0.5000 0.5000 0.607              4

我需要这样 table:

> tab1
   Cluster_number RT_cond TR_cond VR_cond
1               1 0        0        100
2               2 100      100      0  
3               3 100      0        0
4               4 25       25       100  

其中相应列中的值表示相应簇中 GI 的百分比,其中 RT >= 0.6、TR >= 0.6 和 VR >= 0.6,分别为。即,在第一个簇中,所有RT <= 0.6,因此,在最后的table中,值0被写在第一行,例如,在第四个簇中,四个值TR中的一个>= 0.6,所以最后的 table 对应的值为 25,我该怎么做?

您可以 group_by Cluster_number 并使用 across 来计算百分比:

library(dplyr)
df %>%
  group_by(Cluster_number) %>%
  summarise(across(RT:VR, ~mean(. >= 0.6) * 100, .names = '{col}_cond'))
  #In older version of dplyr use summarise_at
  #summarise_at(vars(RT:VR), ~mean(. >= 0.6) * 100)


#  Cluster_number RT_cond TR_cond VR_cond
#           <int>   <dbl>   <dbl>   <dbl>
#1              1       0       0     100
#2              2     100     100       0
#3              3     100       0       0
#4              4      25      25     100

在 base R 中,我们可以使用 aggregate :

aggregate(cbind(RT, TR, VR)~Cluster_number, df, function(x) mean(x >= 0.6) * 100)

数据

df <- structure(list(GI = c(1000086986L, 1000728257L, 1000074769L, 
1000498642L, 1000074765L, 1000597385L, 1000502373L, 1000532631L, 
1000597694L, 1000598724L), RT = c(0.5814, 0.5814, 0.7879, 0.7879, 
0.7941, 0.7941, 0.5, 0.6875, 0.5, 0.5), TR = c(0.5814, 0.5814, 
0.7879, 0.7879, 0.36, 0.36, 0.5, 0.7059, 0.5, 0.5), VR = c(0.628, 
0.628, 0.443, 0.443, 0.533, 0.533, 0.607, 0.607, 0.607, 0.607
), Cluster_number = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L)), 
class = "data.frame", row.names = c("1", "10", "13", "14", "22", 
 "26", "31", "32", "33", "34"))

使用 dplyr 包,您可以使用 group_by 语句后跟 summarise,然后使用新的 rename_with 函数重命名感兴趣的列

library(dplyr)

tab %>% 
  group_by(Cluster_number) %>% 
  summarise(across(c(RT, TR, VR), ~mean(. >= 0.6)*100)) %>% 
  rename_with(~paste0(., "_cond"), c(RT, TR, VR))

# A tibble: 4 x 4
#   Cluster_number RT_cond TR_cond VR_cond
#            <int>   <dbl>   <dbl>   <dbl>
# 1              1       0       0     100
# 2              2     100     100       0
# 3              3     100       0       0
# 4              4      25      25     100