R,如何根据多个条件在列表列中累积值
R, How to accumulate values in a list column, based on multiple criteria
我有一个患者在不同医院(仅限住院患者)接受治疗的数据集,其中一些分析揭示了一些不一致之处。其中之一是 - 软件允许患者在不关闭之前开放的情况下入院 case_id
。
为了更好地理解它,让我们考虑示例数据集
示例数据
dput(df)
df <- structure(list(case_id = 1:22, patient_id = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 7L,
8L, 8L), pack_id = c(12L, 62L, 59L, 68L, 77L, 86L, 20L, 55L,
86L, 72L, 7L, 54L, 75L, 26L, 21L, 12L, 49L, 35L, 51L, 31L, 10L,
54L), hosp_id = c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 3L, 3L, 4L, 2L,
3L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 7L, 8L, 8L), admn_date = structure(c(18262,
18264, 18265, 18266, 18277, 18279, 18283, 18262, 18264, 18277,
18287, 18275, 18301, 18291, 18366, 18374, 18309, 18319, 18364,
18303, 18328, 18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18271, 18275, 18288, 18280, 18286, 18275, 18276, 18288,
18291, 18283, 18309, 18297, 18375, 18381, 18347, 18328, 18367,
18309, 18341, 18344), class = "Date")), row.names = c(NA, -22L
), class = "data.frame")
> df
case_id patient_id pack_id hosp_id admn_date discharge_date
1 1 1 12 1 2020-01-01 2020-01-14
2 2 1 62 1 2020-01-03 2020-01-15
3 3 1 59 2 2020-01-04 2020-01-10
4 4 1 68 2 2020-01-05 2020-01-14
5 5 1 77 1 2020-01-16 2020-01-27
6 6 1 86 1 2020-01-18 2020-01-19
7 7 1 20 2 2020-01-22 2020-01-25
8 8 2 55 3 2020-01-01 2020-01-14
9 9 2 86 3 2020-01-03 2020-01-15
10 10 2 72 4 2020-01-16 2020-01-27
11 11 1 7 2 2020-01-26 2020-01-30
12 12 3 54 3 2020-01-14 2020-01-22
13 13 3 75 3 2020-02-09 2020-02-17
14 14 3 26 3 2020-01-30 2020-02-05
15 15 4 21 4 2020-04-14 2020-04-23
16 16 4 12 5 2020-04-22 2020-04-29
17 17 5 49 6 2020-02-17 2020-03-26
18 18 5 35 6 2020-02-27 2020-03-07
19 19 6 51 7 2020-04-12 2020-04-15
20 20 7 31 7 2020-02-11 2020-02-17
21 21 8 10 8 2020-03-07 2020-03-20
22 22 8 54 8 2020-03-20 2020-03-23
如果我们在上面的数据中看到,id 1 的患者于 1 月 1 日在 hospital_1(第 1 行)入院,并于 1 月 14 日出院。出院前,患者再次入住同一家医院(第 2 行);并在 hospital_2 中再次两次(第 3 和 4 行),然后最终在 1 月 15 日(第 2 行)关闭所有这四个记录。
我已经过滤了 patient/s 在多个 hospitals/same 医院多次入院的记录;通过以下代码
代码已尝试
df_2 <- df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup()
> df_2
# A tibble: 44 x 8
case_id patient_id pack_id hosp_id sort_key activity date admitted
<int> <int> <int> <int> <int> <ord> <date> <dbl>
1 1 1 12 1 1 admn 2020-01-01 1
2 2 1 62 1 2 admn 2020-01-03 2
3 3 1 59 2 3 admn 2020-01-04 3
4 4 1 68 2 4 admn 2020-01-05 4
5 3 1 59 2 3 discharge 2020-01-10 3
6 1 1 12 1 1 discharge 2020-01-14 2
7 4 1 68 2 4 discharge 2020-01-14 1
8 2 1 62 1 2 discharge 2020-01-15 0
9 5 1 77 1 5 admn 2020-01-16 1
10 6 1 86 1 6 admn 2020-01-18 2
# ... with 34 more rows
有了这段代码df_2 %>% filter(admitted >1 & activity == "admn")
我可以一次性过滤掉不一致的记录
但是,我想 include/generate 一个 list column
在没有关闭任何先前的 hsopital_ids 的情况下打开新的 record/case_id每当 activity == 'admn'
和 hospital_id 从现有条目中删除时 activity == 'discharge'
。所以基本上我想要的 df_2
输出是这样的:
期望的输出
# A tibble: 44 x 8
case_id patient_id pack_id hosp_id sort_key activity date admitted open_records
<int> <int> <int> <int> <int> <ord> <date> <dbl> <list>
1 1 1 12 1 1 admn 2020-01-01 1 1
2 2 1 62 1 2 admn 2020-01-03 2 1, 1
3 3 1 59 2 3 admn 2020-01-04 3 1, 1, 2
4 4 1 68 2 4 admn 2020-01-05 4 1, 1, 2, 2
5 3 1 59 2 3 discharge 2020-01-10 3 1, 1, 2
6 1 1 12 1 1 discharge 2020-01-14 2 1, 2
7 4 1 68 2 4 discharge 2020-01-14 1 1,
8 2 1 62 1 2 discharge 2020-01-15 0 <NULL>
9 5 1 77 1 5 admn 2020-01-16 1 1
10 6 1 86 1 6 admn 2020-01-18 2 1, 1
# ... with 34 more rows
注意 我知道列表列不会显示在 tibble/data.frame 中,就像我为解释目的而显示的那样。但是,如果有任何可以打印的方法,我肯定想知道。
MOREOVER 如果有更好的策略将医院 ID 存储在列中而不是生成列表列,我也想知道。
如果您不介意使用循环
library(stringi)
df3 <- df2
df3$open_records <- NA
df3$hosp_id <- as.character(df3$hosp_id) #makes pasting easier
for(i in 1:nrow(df3)){
#if re-admn
if(df3$activity[i] == "admn"){
df3$open_records[i] <- paste(lag(df3$open_records, default = "")[i],
df3$hosp_id[i],
sep = ",")
#we'll handle pretty commas later
}
#if discharge
if(df3$activity[i] == "discharge"){
df3$open_records[i] <- sub(df3$hosp_id[i], "",
stri_reverse(df3$open_records[i-1]))
#sub out one hospital if discharge
#we reverse the string before removing to get the last hosp_id
}
#if admitted == 0
if(df3$admitted[i] == 0) df3$open_records[i] <- NA
#if just starting the group
if(df3$activity[i] == "admn" & df3$admitted[i] == 1){
df3$open_records[i] <- df3$hosp_id[i]
}
}
#comma clean
df3$open_records <- gsub("^,*|(?<=,),|,*$", "", df3$open_records, perl=T)
df3$open_records <- gsub(",", ", ", df3$open_records)
如果您的数据集非常大,这可能不是最佳选择。向每个 if 语句添加 next()
命令可能也是值得的(如果你这样做,我认为将起始组 if 语句移动到循环顶部是有意义的)。
(逗号干净来源:Removing multiple commas and trailing commas using gsub)
编辑,基于不需要使用循环
library(tidyverse)
paste3 <- function(out, input, activity, sep = ",") {
if (activity == "admn") {
paste(out, input, sep = sep)
} else
if (activity == "discharge") {
sub(input, "", out)
}
}
df4 <- df2 %>%
mutate(temp_act = lead(activity)) %>%
mutate(open_records = accumulate2(hosp_id, head(temp_act, -1), paste3)
) %>%
select(-temp_act)
df4$open_records <- gsub("^,*|(?<=,),|,*$", "", df4$open_records, perl=T)
df4$open_records <- gsub(",", ", ", df4$open_records)
我注意到病人可以同时住进同一家医院不止一次。您可能要考虑的一件事是连接 case_id
和 hosp_id
,这样当放电发生时,您可以删除对应于正确 hosp_id
的第一个匹配 hosp_id
=13=]。 (用您的新变量替换代码中的 hosp_id
。)
这不会出现在您的示例代码中,但是如果某人有 open_records 个 2, 1, 2, 1, 2
并且从他们的第三次准入中出院,我的代码将 return 1, 2, 1, 2
当你可能想要 2, 1, 1, 2
.
这是一个不错的 tidyverse
解决方案:
library(dplyr)
library(purrr)
df_2 %>%
group_by(patient_id) %>%
mutate(open_records = accumulate(2:n(), .init = paste0(hosp_id[1], ","),
~ if(activity[.y] == "admn") {
paste0(.x, hosp_id[.y], ",")
} else {
sub(paste0(hosp_id[.y], ","), "", .x)
}),
open_records = gsub("([d,]*)\,$", "", open_records))
# A tibble: 44 x 9
# Groups: patient_id [8]
case_id patient_id pack_id hosp_id sort_key activity date admitted open_records
<int> <int> <int> <int> <int> <ord> <date> <dbl> <chr>
1 1 1 12 1 1 admn 2020-01-01 1 "1"
2 2 1 62 1 2 admn 2020-01-03 2 "1,1"
3 3 1 59 2 3 admn 2020-01-04 3 "1,1,2"
4 4 1 68 2 4 admn 2020-01-05 4 "1,1,2,2"
5 3 1 59 2 3 discharge 2020-01-10 3 "1,1,2"
6 1 1 12 1 1 discharge 2020-01-14 2 "1,2"
7 4 1 68 2 4 discharge 2020-01-14 1 "1"
8 2 1 62 1 2 discharge 2020-01-15 0 ""
9 5 1 77 1 5 admn 2020-01-16 1 "1"
10 6 1 86 1 6 admn 2020-01-18 2 "1,1"
# ... with 34 more rows
我有一个患者在不同医院(仅限住院患者)接受治疗的数据集,其中一些分析揭示了一些不一致之处。其中之一是 - 软件允许患者在不关闭之前开放的情况下入院 case_id
。
为了更好地理解它,让我们考虑示例数据集
示例数据
dput(df)
df <- structure(list(case_id = 1:22, patient_id = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 7L,
8L, 8L), pack_id = c(12L, 62L, 59L, 68L, 77L, 86L, 20L, 55L,
86L, 72L, 7L, 54L, 75L, 26L, 21L, 12L, 49L, 35L, 51L, 31L, 10L,
54L), hosp_id = c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 3L, 3L, 4L, 2L,
3L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 7L, 8L, 8L), admn_date = structure(c(18262,
18264, 18265, 18266, 18277, 18279, 18283, 18262, 18264, 18277,
18287, 18275, 18301, 18291, 18366, 18374, 18309, 18319, 18364,
18303, 18328, 18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18271, 18275, 18288, 18280, 18286, 18275, 18276, 18288,
18291, 18283, 18309, 18297, 18375, 18381, 18347, 18328, 18367,
18309, 18341, 18344), class = "Date")), row.names = c(NA, -22L
), class = "data.frame")
> df
case_id patient_id pack_id hosp_id admn_date discharge_date
1 1 1 12 1 2020-01-01 2020-01-14
2 2 1 62 1 2020-01-03 2020-01-15
3 3 1 59 2 2020-01-04 2020-01-10
4 4 1 68 2 2020-01-05 2020-01-14
5 5 1 77 1 2020-01-16 2020-01-27
6 6 1 86 1 2020-01-18 2020-01-19
7 7 1 20 2 2020-01-22 2020-01-25
8 8 2 55 3 2020-01-01 2020-01-14
9 9 2 86 3 2020-01-03 2020-01-15
10 10 2 72 4 2020-01-16 2020-01-27
11 11 1 7 2 2020-01-26 2020-01-30
12 12 3 54 3 2020-01-14 2020-01-22
13 13 3 75 3 2020-02-09 2020-02-17
14 14 3 26 3 2020-01-30 2020-02-05
15 15 4 21 4 2020-04-14 2020-04-23
16 16 4 12 5 2020-04-22 2020-04-29
17 17 5 49 6 2020-02-17 2020-03-26
18 18 5 35 6 2020-02-27 2020-03-07
19 19 6 51 7 2020-04-12 2020-04-15
20 20 7 31 7 2020-02-11 2020-02-17
21 21 8 10 8 2020-03-07 2020-03-20
22 22 8 54 8 2020-03-20 2020-03-23
如果我们在上面的数据中看到,id 1 的患者于 1 月 1 日在 hospital_1(第 1 行)入院,并于 1 月 14 日出院。出院前,患者再次入住同一家医院(第 2 行);并在 hospital_2 中再次两次(第 3 和 4 行),然后最终在 1 月 15 日(第 2 行)关闭所有这四个记录。
我已经过滤了 patient/s 在多个 hospitals/same 医院多次入院的记录;通过以下代码
代码已尝试
df_2 <- df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup()
> df_2
# A tibble: 44 x 8
case_id patient_id pack_id hosp_id sort_key activity date admitted
<int> <int> <int> <int> <int> <ord> <date> <dbl>
1 1 1 12 1 1 admn 2020-01-01 1
2 2 1 62 1 2 admn 2020-01-03 2
3 3 1 59 2 3 admn 2020-01-04 3
4 4 1 68 2 4 admn 2020-01-05 4
5 3 1 59 2 3 discharge 2020-01-10 3
6 1 1 12 1 1 discharge 2020-01-14 2
7 4 1 68 2 4 discharge 2020-01-14 1
8 2 1 62 1 2 discharge 2020-01-15 0
9 5 1 77 1 5 admn 2020-01-16 1
10 6 1 86 1 6 admn 2020-01-18 2
# ... with 34 more rows
有了这段代码df_2 %>% filter(admitted >1 & activity == "admn")
我可以一次性过滤掉不一致的记录
但是,我想 include/generate 一个 list column
在没有关闭任何先前的 hsopital_ids 的情况下打开新的 record/case_id每当 activity == 'admn'
和 hospital_id 从现有条目中删除时 activity == 'discharge'
。所以基本上我想要的 df_2
输出是这样的:
期望的输出
# A tibble: 44 x 8
case_id patient_id pack_id hosp_id sort_key activity date admitted open_records
<int> <int> <int> <int> <int> <ord> <date> <dbl> <list>
1 1 1 12 1 1 admn 2020-01-01 1 1
2 2 1 62 1 2 admn 2020-01-03 2 1, 1
3 3 1 59 2 3 admn 2020-01-04 3 1, 1, 2
4 4 1 68 2 4 admn 2020-01-05 4 1, 1, 2, 2
5 3 1 59 2 3 discharge 2020-01-10 3 1, 1, 2
6 1 1 12 1 1 discharge 2020-01-14 2 1, 2
7 4 1 68 2 4 discharge 2020-01-14 1 1,
8 2 1 62 1 2 discharge 2020-01-15 0 <NULL>
9 5 1 77 1 5 admn 2020-01-16 1 1
10 6 1 86 1 6 admn 2020-01-18 2 1, 1
# ... with 34 more rows
注意 我知道列表列不会显示在 tibble/data.frame 中,就像我为解释目的而显示的那样。但是,如果有任何可以打印的方法,我肯定想知道。
MOREOVER 如果有更好的策略将医院 ID 存储在列中而不是生成列表列,我也想知道。
如果您不介意使用循环
library(stringi)
df3 <- df2
df3$open_records <- NA
df3$hosp_id <- as.character(df3$hosp_id) #makes pasting easier
for(i in 1:nrow(df3)){
#if re-admn
if(df3$activity[i] == "admn"){
df3$open_records[i] <- paste(lag(df3$open_records, default = "")[i],
df3$hosp_id[i],
sep = ",")
#we'll handle pretty commas later
}
#if discharge
if(df3$activity[i] == "discharge"){
df3$open_records[i] <- sub(df3$hosp_id[i], "",
stri_reverse(df3$open_records[i-1]))
#sub out one hospital if discharge
#we reverse the string before removing to get the last hosp_id
}
#if admitted == 0
if(df3$admitted[i] == 0) df3$open_records[i] <- NA
#if just starting the group
if(df3$activity[i] == "admn" & df3$admitted[i] == 1){
df3$open_records[i] <- df3$hosp_id[i]
}
}
#comma clean
df3$open_records <- gsub("^,*|(?<=,),|,*$", "", df3$open_records, perl=T)
df3$open_records <- gsub(",", ", ", df3$open_records)
如果您的数据集非常大,这可能不是最佳选择。向每个 if 语句添加 next()
命令可能也是值得的(如果你这样做,我认为将起始组 if 语句移动到循环顶部是有意义的)。
(逗号干净来源:Removing multiple commas and trailing commas using gsub)
编辑,基于不需要使用循环
library(tidyverse)
paste3 <- function(out, input, activity, sep = ",") {
if (activity == "admn") {
paste(out, input, sep = sep)
} else
if (activity == "discharge") {
sub(input, "", out)
}
}
df4 <- df2 %>%
mutate(temp_act = lead(activity)) %>%
mutate(open_records = accumulate2(hosp_id, head(temp_act, -1), paste3)
) %>%
select(-temp_act)
df4$open_records <- gsub("^,*|(?<=,),|,*$", "", df4$open_records, perl=T)
df4$open_records <- gsub(",", ", ", df4$open_records)
我注意到病人可以同时住进同一家医院不止一次。您可能要考虑的一件事是连接 case_id
和 hosp_id
,这样当放电发生时,您可以删除对应于正确 hosp_id
的第一个匹配 hosp_id
=13=]。 (用您的新变量替换代码中的 hosp_id
。)
这不会出现在您的示例代码中,但是如果某人有 open_records 个 2, 1, 2, 1, 2
并且从他们的第三次准入中出院,我的代码将 return 1, 2, 1, 2
当你可能想要 2, 1, 1, 2
.
这是一个不错的 tidyverse
解决方案:
library(dplyr)
library(purrr)
df_2 %>%
group_by(patient_id) %>%
mutate(open_records = accumulate(2:n(), .init = paste0(hosp_id[1], ","),
~ if(activity[.y] == "admn") {
paste0(.x, hosp_id[.y], ",")
} else {
sub(paste0(hosp_id[.y], ","), "", .x)
}),
open_records = gsub("([d,]*)\,$", "", open_records))
# A tibble: 44 x 9
# Groups: patient_id [8]
case_id patient_id pack_id hosp_id sort_key activity date admitted open_records
<int> <int> <int> <int> <int> <ord> <date> <dbl> <chr>
1 1 1 12 1 1 admn 2020-01-01 1 "1"
2 2 1 62 1 2 admn 2020-01-03 2 "1,1"
3 3 1 59 2 3 admn 2020-01-04 3 "1,1,2"
4 4 1 68 2 4 admn 2020-01-05 4 "1,1,2,2"
5 3 1 59 2 3 discharge 2020-01-10 3 "1,1,2"
6 1 1 12 1 1 discharge 2020-01-14 2 "1,2"
7 4 1 68 2 4 discharge 2020-01-14 1 "1"
8 2 1 62 1 2 discharge 2020-01-15 0 ""
9 5 1 77 1 5 admn 2020-01-16 1 "1"
10 6 1 86 1 6 admn 2020-01-18 2 "1,1"
# ... with 34 more rows