使用 dplyr 对长格式数据计算因子水平和数值的唯一出现次数
Count unique occurrences of factor levels and numeric values with dplyr, on data in a long format
我有 8 名患者的重复测量数据,每个患者对相同变量的重复测量次数各不相同。测量的变量是性别、血压 (sys_bp) 以及一个人接受了多少次 CT 扫描:
library(dplyr)
library(magrittr)
questiondata <- structure(list(id = c(2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 20,
20, 20), time = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L, 1L, 6L, 1L, 2L, 5L, 1L, 2L, 3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L, 1L, 2L, 4L), .Label = c("T0", "T1M0", "T1M6",
"T1M12", "T2M0", "FU1"), class = "factor"), sys_bp = c(116, 125.8,
NA, NA, NA, 113.2, NA, NA, NA, NA, 146, NA, NA, NA, NA, NA, NA,
125, NA, NA, 164.5, NA, NA, NA, NA, 150.5, NA, NA, NA, NA, 158,
NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("female", "male"), class = "factor"),
ct_amount = c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 3L, 3L, 3L)), row.names = c(NA, -32L), class = c("tbl_df",
"tbl", "data.frame"))
questiondata
id time sys_bp sex ct_amount
<dbl> <fct> <dbl> <fct> <int>
1 2 T0 116 female 4
2 2 T1M0 126. female 4
3 2 T1M6 NA female 4
4 2 T1M12 NA female 4
5 3 T0 NA female 5
6 3 T1M0 113. female 5
7 3 T1M6 NA female 5
8 3 T1M12 NA female 5
9 3 T2M0 NA female 5
10 4 T0 NA male 5
11 4 T1M0 146 male 5
12 4 T1M6 NA male 5
13 4 T1M12 NA male 5
14 4 T2M0 NA male 5
15 7 T0 NA female 2
16 7 FU1 NA female 2
17 8 T0 NA female 3
18 8 T1M0 125 female 3
19 8 T2M0 NA female 3
20 13 T0 NA female 5
21 13 T1M0 164. female 5
22 13 T1M6 NA female 5
23 13 T1M12 NA female 5
24 13 T2M0 NA female 5
25 14 T0 NA male 5
26 14 T1M0 150. male 5
27 14 T1M6 NA male 5
28 14 T1M12 NA male 5
29 14 T2M0 NA male 5
30 20 T0 NA female 3
31 20 T1M0 158 female 3
32 20 T1M12 NA female 3
我正在计算 (1) male/female (2) 有 1/2/3/4/5 次 CT 扫描的人数。
因此输出将是 (1) 6 名女性和 2 名男性,以及 (2) 1 人有 2 个 CT,2 人有 3 个 CT,1 个人有 4 个 CT,4 个人有 5 个 CT。
我尝试了很多 group_by
和 summarise
以及 count
的组合,但似乎无法正确组合。有帮助吗?
您可以首先只保留每个 id
的唯一行。然后使用count
得到输出。
library(dplyr)
unique_data <- questiondata %>% distinct(id, .keep_all = TRUE)
unique_data %>% count(sex)
# A tibble: 2 x 2
# sex n
# <fct> <int>
#1 female 6
#2 male 2
unique_data %>% count(ct_amount)
# A tibble: 4 x 2
# ct_amount n
# <int> <int>
#1 2 1
#2 3 2
#3 4 1
#4 5 4
我们可以使用 duplicated
和 filter
library(dplyr)
questiondata %>%
filter(!duplicated(id)) %>%
count(ct_amount)
# A tibble: 4 x 2
ct_amount n
<int> <int>
1 2 1
2 3 2
3 4 1
4 5 4
我有 8 名患者的重复测量数据,每个患者对相同变量的重复测量次数各不相同。测量的变量是性别、血压 (sys_bp) 以及一个人接受了多少次 CT 扫描:
library(dplyr)
library(magrittr)
questiondata <- structure(list(id = c(2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 7, 7, 8, 8, 8, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 20,
20, 20), time = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 5L, 1L, 6L, 1L, 2L, 5L, 1L, 2L, 3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L, 1L, 2L, 4L), .Label = c("T0", "T1M0", "T1M6",
"T1M12", "T2M0", "FU1"), class = "factor"), sys_bp = c(116, 125.8,
NA, NA, NA, 113.2, NA, NA, NA, NA, 146, NA, NA, NA, NA, NA, NA,
125, NA, NA, 164.5, NA, NA, NA, NA, 150.5, NA, NA, NA, NA, 158,
NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("female", "male"), class = "factor"),
ct_amount = c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 3L, 3L, 3L)), row.names = c(NA, -32L), class = c("tbl_df",
"tbl", "data.frame"))
questiondata
id time sys_bp sex ct_amount
<dbl> <fct> <dbl> <fct> <int>
1 2 T0 116 female 4
2 2 T1M0 126. female 4
3 2 T1M6 NA female 4
4 2 T1M12 NA female 4
5 3 T0 NA female 5
6 3 T1M0 113. female 5
7 3 T1M6 NA female 5
8 3 T1M12 NA female 5
9 3 T2M0 NA female 5
10 4 T0 NA male 5
11 4 T1M0 146 male 5
12 4 T1M6 NA male 5
13 4 T1M12 NA male 5
14 4 T2M0 NA male 5
15 7 T0 NA female 2
16 7 FU1 NA female 2
17 8 T0 NA female 3
18 8 T1M0 125 female 3
19 8 T2M0 NA female 3
20 13 T0 NA female 5
21 13 T1M0 164. female 5
22 13 T1M6 NA female 5
23 13 T1M12 NA female 5
24 13 T2M0 NA female 5
25 14 T0 NA male 5
26 14 T1M0 150. male 5
27 14 T1M6 NA male 5
28 14 T1M12 NA male 5
29 14 T2M0 NA male 5
30 20 T0 NA female 3
31 20 T1M0 158 female 3
32 20 T1M12 NA female 3
我正在计算 (1) male/female (2) 有 1/2/3/4/5 次 CT 扫描的人数。
因此输出将是 (1) 6 名女性和 2 名男性,以及 (2) 1 人有 2 个 CT,2 人有 3 个 CT,1 个人有 4 个 CT,4 个人有 5 个 CT。
我尝试了很多 group_by
和 summarise
以及 count
的组合,但似乎无法正确组合。有帮助吗?
您可以首先只保留每个 id
的唯一行。然后使用count
得到输出。
library(dplyr)
unique_data <- questiondata %>% distinct(id, .keep_all = TRUE)
unique_data %>% count(sex)
# A tibble: 2 x 2
# sex n
# <fct> <int>
#1 female 6
#2 male 2
unique_data %>% count(ct_amount)
# A tibble: 4 x 2
# ct_amount n
# <int> <int>
#1 2 1
#2 3 2
#3 4 1
#4 5 4
我们可以使用 duplicated
和 filter
library(dplyr)
questiondata %>%
filter(!duplicated(id)) %>%
count(ct_amount)
# A tibble: 4 x 2
ct_amount n
<int> <int>
1 2 1
2 3 2
3 4 1
4 5 4