如何在 R 中分组和取消分组?
How to group and ungroup in R?
我有一个如下所示的数据框
test_df <- data.frame("SN" = c("ABC123","ABC123","ABC123","MNO098","MNO098","MNO098"),
"code" = c("ABC1111","DEF222","GHI133","","MNO1123","MNO567"),
"d_time" = c("2220-08-27","2220-05-27","2220-02-27","2220-11-27","2220-02-27",""))
我正在尝试做两件事
1) 通过从列 SN
和 code
中删除字母创建 2 个新列 (p_id
,v_id
) 并仅保留 9 位数字
2) 基于 v_id
为基于 his/her d_time
排序的每个人创建滞后列 (p_vid
)
t_df <- test_df %>% group_by(SN)
t_df %>% arrange((d_time), .by_group = TRUE) ->> sorted_df #sorted based on d_time
transform_ids = function(DF){ # this function is to create person and visit_occurrence ids
DF %>%
mutate(p_id = as.integer(str_remove_all(.$SN,"[a-z]|[A-Z]") %>% #retaining only the numeric part
str_sub(1,9))) %>%
mutate(v_id = as.integer(str_remove_all(.$code,"[a-z]|[A-Z]") %>%
str_sub(1,9))) %>%
group_by(p_id) %>%
mutate(pre_vid = lag(v_id)) %>%
ungroup
}
transform_ids(sorted_df)
但是当我这样做时,我遇到了以下错误
Error in View : Column p_id
must be length 3 (the group size) or one, not 6
Error: Column p_id
must be length 3 (the group size) or one, not 6
In addition: Warning message:
In view(transform_ids(t_df)) :
Show Traceback
Rerun with Debug
Error: Column p_id
must be length 3 (the group size) or one, not 6
我希望我的输出如下所示。基本上我试图 link 一个人的每个 v_id
到他以前的访问,即 p_vid
要生成 p_id
和 v_id
列,只需使用 sub
:
t_df$p_id <- gsub("[A-Z]+", "", t_df$SN)
t_df$v_id <- gsub("[A-Z]+", "", t_df$code)
对于 p_vid
列,使用 dplyr
包中的 lag()
:
t_df %>%
group_by(p_id) %>%
mutate(p_vid = lag(v_id, order_by=d_time, default=0))
上面的输出实际上给了你一个提示。如果你想要一个数据框,只需使用:
t_df <- as.data.frame(t_df)
输出:
SN code d_time p_id v_id p_vid
<fct> <fct> <fct> <chr> <chr> <chr>
1 ABC123 ABC1111 2220-08-27 123 1111 222
2 ABC123 DEF222 2220-05-27 123 222 133
3 ABC123 GHI133 2220-02-27 123 133 0
4 MNO098 "" 2220-11-27 098 "" 1123
5 MNO098 MNO1123 2220-02-27 098 1123 567
6 MNO098 MNO567 "" 098 567 0
我有一个如下所示的数据框
test_df <- data.frame("SN" = c("ABC123","ABC123","ABC123","MNO098","MNO098","MNO098"),
"code" = c("ABC1111","DEF222","GHI133","","MNO1123","MNO567"),
"d_time" = c("2220-08-27","2220-05-27","2220-02-27","2220-11-27","2220-02-27",""))
我正在尝试做两件事
1) 通过从列 SN
和 code
中删除字母创建 2 个新列 (p_id
,v_id
) 并仅保留 9 位数字
2) 基于 v_id
为基于 his/her d_time
p_vid
)
t_df <- test_df %>% group_by(SN)
t_df %>% arrange((d_time), .by_group = TRUE) ->> sorted_df #sorted based on d_time
transform_ids = function(DF){ # this function is to create person and visit_occurrence ids
DF %>%
mutate(p_id = as.integer(str_remove_all(.$SN,"[a-z]|[A-Z]") %>% #retaining only the numeric part
str_sub(1,9))) %>%
mutate(v_id = as.integer(str_remove_all(.$code,"[a-z]|[A-Z]") %>%
str_sub(1,9))) %>%
group_by(p_id) %>%
mutate(pre_vid = lag(v_id)) %>%
ungroup
}
transform_ids(sorted_df)
但是当我这样做时,我遇到了以下错误
Error in View : Column
p_id
must be length 3 (the group size) or one, not 6 Error: Columnp_id
must be length 3 (the group size) or one, not 6 In addition: Warning message: In view(transform_ids(t_df)) : Show Traceback Rerun with Debug Error: Columnp_id
must be length 3 (the group size) or one, not 6
我希望我的输出如下所示。基本上我试图 link 一个人的每个 v_id
到他以前的访问,即 p_vid
要生成 p_id
和 v_id
列,只需使用 sub
:
t_df$p_id <- gsub("[A-Z]+", "", t_df$SN)
t_df$v_id <- gsub("[A-Z]+", "", t_df$code)
对于 p_vid
列,使用 dplyr
包中的 lag()
:
t_df %>%
group_by(p_id) %>%
mutate(p_vid = lag(v_id, order_by=d_time, default=0))
上面的输出实际上给了你一个提示。如果你想要一个数据框,只需使用:
t_df <- as.data.frame(t_df)
输出:
SN code d_time p_id v_id p_vid
<fct> <fct> <fct> <chr> <chr> <chr>
1 ABC123 ABC1111 2220-08-27 123 1111 222
2 ABC123 DEF222 2220-05-27 123 222 133
3 ABC123 GHI133 2220-02-27 123 133 0
4 MNO098 "" 2220-11-27 098 "" 1123
5 MNO098 MNO1123 2220-02-27 098 1123 567
6 MNO098 MNO567 "" 098 567 0