如何提取在另一家医院出院前入院的病历
How to extract records to those patients who got admitted before discharge in another hospital
我正在分析多家医院患者 admission/discharge 的各种不一致数据。
我的数据结构就像-
- Row_id;只不过是记录的唯一标识符(在其他一些中用作外键table)
- patient_id:患者的唯一标识符键
- pack_id : 患者选择治疗的医疗套餐
- hospital_id:医院的唯一标识符
- admn_dt : 入学日期
- discharge_date : 患者出院日期
数据快照
row_id patient_id pack_id hosp_id admn_date discharge_date
1 1 12 1 01-01-2020 14-01-2020
2 1 62 2 03-01-2020 15-01-2020
3 1 77 1 16-01-2020 27-01-2020
4 1 86 1 18-01-2020 19-01-2020
5 1 20 2 22-01-2020 25-01-2020
6 2 55 3 01-01-2020 14-01-2020
7 2 86 3 03-01-2020 17-01-2020
8 2 72 4 16-01-2020 27-01-2020
9 1 7 1 26-01-2020 30-01-2020
10 3 54 5 14-01-2020 22-01-2020
11 3 75 5 09-02-2020 17-02-2020
12 3 26 6 22-01-2020 05-02-2020
13 4 21 7 14-04-2020 23-04-2020
14 4 12 7 23-04-2020 29-04-2020
15 5 49 8 17-03-2020 26-03-2020
16 5 35 9 27-02-2020 07-03-2020
17 6 51 10 12-04-2020 15-04-2020
18 7 31 11 11-02-2020 17-02-2020
19 8 10 12 07-03-2020 08-03-2020
20 8 54 13 20-03-2020 23-03-2020
数据样本dput
如下:
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
我必须确定患者入院但未出院的记录。为此,我使用了以下代码,从该线程 -
获得帮助
library(tidyverse)
df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup() %>%
filter(admitted >1, activity == "admn")
这很好地为我提供了所有患者入院但未从之前的治疗中出院的记录。
输出-
# A tibble: 6 x 8
row_id patient_id pack_id hosp_id sort_key activity date admitted
<chr> <chr> <chr> <chr> <int> <ord> <date> <dbl>
1 2 1 62 2 2 admn 2020-01-03 2
2 4 1 86 1 4 admn 2020-01-18 2
3 5 1 20 2 5 admn 2020-01-22 2
4 9 1 7 1 6 admn 2020-01-26 2
5 7 2 86 3 8 admn 2020-01-03 2
6 8 2 72 4 9 admn 2020-01-16 2
解释-
Row_id 2 是正确的,因为它与 row_id 1
重叠
Row_id 4 是正确的,因为它与 row_id 3
重叠
Row_id 5 是正确的,因为它与 row_id 3(再次)重叠
Row_id 9 是正确的,因为它与 row_id 3(再次)重叠
Row_id 7 是正确的,因为它与 row_id 6
重叠
Row_id 8 是正确的,因为它与 row_id 7
重叠
现在我被一个给定的验证规则所困,即 允许患者在同一家医院入院 n 次,而无需实际验证他们之前的出院情况。 换句话说,我必须只提取那些患者在另一家医院入院而没有从“另一家医院”出院的记录。如果医院相同,group_by
at hosp_id
字段可以为我完成工作,但这里的情况实际上是相反的。对于相同的 hosp_id
是允许的,但对于不同的是不允许的。
请帮助我如何进行?
如果我可以将结果 row_id 与其重叠记录的 row_id 映射,也许我们可以解决问题。
期望的输出-
row_id
2
5
8
因为 row_ids 4, 9 和 7 与具有相同医院 ID 的记录重叠。
提前致谢。
P.S。虽然已经给出了所需的解决方案,但我想知道它是否可以通过 map
/apply
函数组 and/or 通过 data.table 包来完成?
再次按患者id分组,然后统计医院id。然后将其重新合并并过滤数据。
类似于:
admitted_not_validated %>%
left_join(
admitted_not_validated %>%
group_by(patient_id) %>%
summarize (multi_hosp = length(unique(hosp_id)),.groups ='drop'),
by = 'patient_id') %>%
filter(multi_hosp >1)
这是您要找的吗? (具体可以参考代码中的注释,有需要的我可以进行说明。)
#Your data
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
#Solution
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)
#Convert patient_id column into numeric
df$patient_id <- as.numeric(df$patient_id)
#Create empty (well, 1 row) data.frame to
#collect output data
#This needs three additional columns
#(as indicated)
outdat <- data.frame(matrix(nrow = 1, ncol = 9), stringsAsFactors = FALSE)
names(outdat) <- c(names(df), "ref_discharge_date", "ref_hosp_id", "overlap")
#Logic:
#For each unique patient_id take all
#their records.
#For each row of each such set of records
#compare its discharge_date with the admn_date
#of all other records with admn_date >= its own
#admn_date
#Then register the time interval between this row's
#discharge_date and the compared row's admn_date
#as a numeric value ("overlap")
#The idea is that concurrent hospital stays will have
#negative overlaps as the admn_date (of the current stay)
#will precede the discharge_date (of the previous one)
for(i in 1:length(unique(df$patient_id))){
#i <- 7
curdat <- df %>% filter(patient_id == unique(df$patient_id)[i])
curdat %<>% mutate(admn_date = lubridate::as_date(admn_date),
discharge_date = lubridate::as_date(discharge_date))
curdat %<>% arrange(admn_date)
for(j in 1:nrow(curdat)){
#j <- 1
currow <- curdat[j, ]
#otrows <- curdat[-j, ]
#
otrows <- curdat %>% filter(admn_date >= currow$admn_date)
#otrows <- curdat
for(k in 1:nrow(otrows)){
otrows$ref_discharge_date[k] <- currow$discharge_date
#otrows$refdisc[k] <- as_date(otrows$refdisc[k])
otrows$ref_hosp_id[k] <- currow$hosp_id
otrows$overlap[k] <- as.numeric(difftime(otrows$admn_date[k], currow$discharge_date))
}
otrows$ref_discharge_date <- as_date(otrows$ref_discharge_date)
outdat <- bind_rows(outdat, otrows)
}
}
rm(curdat, i, j, k, otrows, currow)
#Removing that NA row + removing all self-rows
outdat %<>%
filter(!is.na(patient_id)) %>%
filter(discharge_date != ref_discharge_date)
#Filter out only negative overlaps
outdat %<>% filter(overlap < 0)
#Filter out only those records where the patient
#was admitted to different hospitals
outdat %<>% filter(hosp_id != ref_hosp_id)
outdat
# row_id patient_id pack_id hosp_id admn_date discharge_date ref_discharge_date ref_hosp_id overlap
# 1 2 1 62 2 2020-01-03 2020-01-15 2020-01-14 1 -11
# 2 5 1 20 2 2020-01-22 2020-01-25 2020-01-27 1 -5
# 3 8 2 72 4 2020-01-16 2020-01-27 2020-01-17 3 -1
我正在分析多家医院患者 admission/discharge 的各种不一致数据。
我的数据结构就像-
- Row_id;只不过是记录的唯一标识符(在其他一些中用作外键table)
- patient_id:患者的唯一标识符键
- pack_id : 患者选择治疗的医疗套餐
- hospital_id:医院的唯一标识符
- admn_dt : 入学日期
- discharge_date : 患者出院日期
数据快照
row_id patient_id pack_id hosp_id admn_date discharge_date
1 1 12 1 01-01-2020 14-01-2020
2 1 62 2 03-01-2020 15-01-2020
3 1 77 1 16-01-2020 27-01-2020
4 1 86 1 18-01-2020 19-01-2020
5 1 20 2 22-01-2020 25-01-2020
6 2 55 3 01-01-2020 14-01-2020
7 2 86 3 03-01-2020 17-01-2020
8 2 72 4 16-01-2020 27-01-2020
9 1 7 1 26-01-2020 30-01-2020
10 3 54 5 14-01-2020 22-01-2020
11 3 75 5 09-02-2020 17-02-2020
12 3 26 6 22-01-2020 05-02-2020
13 4 21 7 14-04-2020 23-04-2020
14 4 12 7 23-04-2020 29-04-2020
15 5 49 8 17-03-2020 26-03-2020
16 5 35 9 27-02-2020 07-03-2020
17 6 51 10 12-04-2020 15-04-2020
18 7 31 11 11-02-2020 17-02-2020
19 8 10 12 07-03-2020 08-03-2020
20 8 54 13 20-03-2020 23-03-2020
数据样本dput
如下:
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
我必须确定患者入院但未出院的记录。为此,我使用了以下代码,从该线程
library(tidyverse)
df %>% arrange(patient_id, admn_date, discharge_date) %>%
mutate(sort_key = row_number()) %>%
pivot_longer(c(admn_date, discharge_date), names_to ="activity",
values_to ="date", names_pattern = "(.*)_date") %>%
mutate(activity = factor(activity, ordered = T,
levels = c("admn", "discharge")),
admitted = ifelse(activity == "admn", 1, -1)) %>%
group_by(patient_id) %>%
arrange(date, sort_key, activity, .by_group = TRUE) %>%
mutate (admitted = cumsum(admitted)) %>%
ungroup() %>%
filter(admitted >1, activity == "admn")
这很好地为我提供了所有患者入院但未从之前的治疗中出院的记录。
输出-
# A tibble: 6 x 8
row_id patient_id pack_id hosp_id sort_key activity date admitted
<chr> <chr> <chr> <chr> <int> <ord> <date> <dbl>
1 2 1 62 2 2 admn 2020-01-03 2
2 4 1 86 1 4 admn 2020-01-18 2
3 5 1 20 2 5 admn 2020-01-22 2
4 9 1 7 1 6 admn 2020-01-26 2
5 7 2 86 3 8 admn 2020-01-03 2
6 8 2 72 4 9 admn 2020-01-16 2
解释-
Row_id 2 是正确的,因为它与 row_id 1
重叠Row_id 4 是正确的,因为它与 row_id 3
重叠Row_id 5 是正确的,因为它与 row_id 3(再次)重叠
Row_id 9 是正确的,因为它与 row_id 3(再次)重叠
Row_id 7 是正确的,因为它与 row_id 6
重叠Row_id 8 是正确的,因为它与 row_id 7
重叠现在我被一个给定的验证规则所困,即 允许患者在同一家医院入院 n 次,而无需实际验证他们之前的出院情况。 换句话说,我必须只提取那些患者在另一家医院入院而没有从“另一家医院”出院的记录。如果医院相同,group_by
at hosp_id
字段可以为我完成工作,但这里的情况实际上是相反的。对于相同的 hosp_id
是允许的,但对于不同的是不允许的。
请帮助我如何进行?
如果我可以将结果 row_id 与其重叠记录的 row_id 映射,也许我们可以解决问题。
期望的输出-
row_id
2
5
8
因为 row_ids 4, 9 和 7 与具有相同医院 ID 的记录重叠。
提前致谢。
P.S。虽然已经给出了所需的解决方案,但我想知道它是否可以通过 map
/apply
函数组 and/or 通过 data.table 包来完成?
再次按患者id分组,然后统计医院id。然后将其重新合并并过滤数据。
类似于:
admitted_not_validated %>%
left_join(
admitted_not_validated %>%
group_by(patient_id) %>%
summarize (multi_hosp = length(unique(hosp_id)),.groups ='drop'),
by = 'patient_id') %>%
filter(multi_hosp >1)
这是您要找的吗? (具体可以参考代码中的注释,有需要的我可以进行说明。)
#Your data
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18",
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2",
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72",
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10",
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1",
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262,
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275,
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328,
18341), class = "Date"), discharge_date = structure(c(18275,
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283,
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329,
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
#Solution
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)
#Convert patient_id column into numeric
df$patient_id <- as.numeric(df$patient_id)
#Create empty (well, 1 row) data.frame to
#collect output data
#This needs three additional columns
#(as indicated)
outdat <- data.frame(matrix(nrow = 1, ncol = 9), stringsAsFactors = FALSE)
names(outdat) <- c(names(df), "ref_discharge_date", "ref_hosp_id", "overlap")
#Logic:
#For each unique patient_id take all
#their records.
#For each row of each such set of records
#compare its discharge_date with the admn_date
#of all other records with admn_date >= its own
#admn_date
#Then register the time interval between this row's
#discharge_date and the compared row's admn_date
#as a numeric value ("overlap")
#The idea is that concurrent hospital stays will have
#negative overlaps as the admn_date (of the current stay)
#will precede the discharge_date (of the previous one)
for(i in 1:length(unique(df$patient_id))){
#i <- 7
curdat <- df %>% filter(patient_id == unique(df$patient_id)[i])
curdat %<>% mutate(admn_date = lubridate::as_date(admn_date),
discharge_date = lubridate::as_date(discharge_date))
curdat %<>% arrange(admn_date)
for(j in 1:nrow(curdat)){
#j <- 1
currow <- curdat[j, ]
#otrows <- curdat[-j, ]
#
otrows <- curdat %>% filter(admn_date >= currow$admn_date)
#otrows <- curdat
for(k in 1:nrow(otrows)){
otrows$ref_discharge_date[k] <- currow$discharge_date
#otrows$refdisc[k] <- as_date(otrows$refdisc[k])
otrows$ref_hosp_id[k] <- currow$hosp_id
otrows$overlap[k] <- as.numeric(difftime(otrows$admn_date[k], currow$discharge_date))
}
otrows$ref_discharge_date <- as_date(otrows$ref_discharge_date)
outdat <- bind_rows(outdat, otrows)
}
}
rm(curdat, i, j, k, otrows, currow)
#Removing that NA row + removing all self-rows
outdat %<>%
filter(!is.na(patient_id)) %>%
filter(discharge_date != ref_discharge_date)
#Filter out only negative overlaps
outdat %<>% filter(overlap < 0)
#Filter out only those records where the patient
#was admitted to different hospitals
outdat %<>% filter(hosp_id != ref_hosp_id)
outdat
# row_id patient_id pack_id hosp_id admn_date discharge_date ref_discharge_date ref_hosp_id overlap
# 1 2 1 62 2 2020-01-03 2020-01-15 2020-01-14 1 -11
# 2 5 1 20 2 2020-01-22 2020-01-25 2020-01-27 1 -5
# 3 8 2 72 4 2020-01-16 2020-01-27 2020-01-17 3 -1