如何提取在另一家医院出院前入院的病历

How to extract records to those patients who got admitted before discharge in another hospital

我正在分析多家医院患者 admission/discharge 的各种不一致数据。

我的数据结构就像-

  1. Row_id;只不过是记录的唯一标识符(在其他一些中用作外键table)
  2. patient_id:患者的唯一标识符键
  3. pack_id : 患者选择治疗的医疗套餐
  4. hospital_id:医院的唯一标识符
  5. admn_dt : 入学日期
  6. discharge_date : 患者出院日期

数据快照

row_id  patient_id  pack_id hosp_id admn_date   discharge_date
1   1   12  1   01-01-2020  14-01-2020
2   1   62  2   03-01-2020  15-01-2020
3   1   77  1   16-01-2020  27-01-2020
4   1   86  1   18-01-2020  19-01-2020
5   1   20  2   22-01-2020  25-01-2020
6   2   55  3   01-01-2020  14-01-2020
7   2   86  3   03-01-2020  17-01-2020
8   2   72  4   16-01-2020  27-01-2020
9   1   7   1   26-01-2020  30-01-2020
10  3   54  5   14-01-2020  22-01-2020
11  3   75  5   09-02-2020  17-02-2020
12  3   26  6   22-01-2020  05-02-2020
13  4   21  7   14-04-2020  23-04-2020
14  4   12  7   23-04-2020  29-04-2020
15  5   49  8   17-03-2020  26-03-2020
16  5   35  9   27-02-2020  07-03-2020
17  6   51  10  12-04-2020  15-04-2020
18  7   31  11  11-02-2020  17-02-2020
19  8   10  12  07-03-2020  08-03-2020
20  8   54  13  20-03-2020  23-03-2020

数据样本dput如下:

df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7", 
"8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", 
"19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2", 
"2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72", 
"7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10", 
"54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1", 
"5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262, 
18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275, 
18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328, 
18341), class = "Date"), discharge_date = structure(c(18275, 
18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283, 
18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329, 
18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")

我必须确定患者入院但未出院的记录。为此,我使用了以下代码,从该线程 -

获得帮助
library(tidyverse)

df %>% arrange(patient_id, admn_date, discharge_date) %>%
  mutate(sort_key = row_number()) %>%
  pivot_longer(c(admn_date, discharge_date), names_to ="activity", 
               values_to ="date", names_pattern = "(.*)_date") %>%
  mutate(activity = factor(activity, ordered = T, 
                           levels = c("admn", "discharge")),
         admitted = ifelse(activity == "admn", 1, -1)) %>%
  group_by(patient_id) %>%
  arrange(date, sort_key, activity, .by_group = TRUE) %>% 
  mutate (admitted = cumsum(admitted)) %>%
  ungroup() %>%
  filter(admitted >1, activity == "admn")

这很好地为我提供了所有患者入院但未从之前的治疗中出院的记录。

输出-

# A tibble: 6 x 8
  row_id patient_id pack_id hosp_id sort_key activity date       admitted
  <chr>  <chr>      <chr>   <chr>      <int> <ord>    <date>        <dbl>
1 2      1          62      2              2 admn     2020-01-03        2
2 4      1          86      1              4 admn     2020-01-18        2
3 5      1          20      2              5 admn     2020-01-22        2
4 9      1          7       1              6 admn     2020-01-26        2
5 7      2          86      3              8 admn     2020-01-03        2
6 8      2          72      4              9 admn     2020-01-16        2

解释-

Row_id 2 是正确的,因为它与 row_id 1

重叠

Row_id 4 是正确的,因为它与 row_id 3

重叠

Row_id 5 是正确的,因为它与 row_id 3(再次)重叠

Row_id 9 是正确的,因为它与 row_id 3(再次)重叠

Row_id 7 是正确的,因为它与 row_id 6

重叠

Row_id 8 是正确的,因为它与 row_id 7

重叠

现在我被一个给定的验证规则所困,即 允许患者在同一家医院入院 n 次,而无需实际验证他们之前的出院情况。 换句话说,我必须只提取那些患者在另一家医院入院而没有从“另一家医院”出院的记录。如果医院相同,group_by at hosp_id 字段可以为我完成工作,但这里的情况实际上是相反的。对于相同的 hosp_id 是允许的,但对于不同的是不允许的。 请帮助我如何进行? 如果我可以将结果 row_id 与其重叠记录的 row_id 映射,也许我们可以解决问题。

期望的输出-

row_id
2
5
8

因为 row_ids 4, 9 和 7 与具有相同医院 ID 的记录重叠。

提前致谢。 P.S。虽然已经给出了所需的解决方案,但我想知道它是否可以通过 map/apply 函数组 and/or 通过 data.table 包来完成?

再次按患者id分组,然后统计医院id。然后将其重新合并并过滤数据。

类似于:

admitted_not_validated %>%
  left_join(
    admitted_not_validated %>%
    group_by(patient_id)  %>% 
    summarize (multi_hosp = length(unique(hosp_id)),.groups ='drop'), 
   by = 'patient_id') %>%
  filter(multi_hosp >1) 

这是您要找的吗? (具体可以参考代码中的注释,有需要的我可以进行说明。)

#Your data
df <- structure(list(row_id = c("1", "2", "3", "4", "5", "6", "7", 
                                "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", 
                                "19", "20"), patient_id = c("1", "1", "1", "1", "1", "2", "2", 
                                                            "2", "1", "3", "3", "3", "4", "4", "5", "5", "6", "7", "8", "8"
                                ), pack_id = c("12", "62", "77", "86", "20", "55", "86", "72", 
                                               "7", "54", "75", "26", "21", "12", "49", "35", "51", "31", "10", 
                                               "54"), hosp_id = c("1", "2", "1", "1", "2", "3", "3", "4", "1", 
                                                                  "5", "5", "6", "7", "7", "8", "9", "10", "11", "12", "13"), admn_date = structure(c(18262, 
                                                                                                                                                      18264, 18277, 18279, 18283, 18262, 18264, 18277, 18287, 18275, 
                                                                                                                                                      18301, 18283, 18366, 18375, 18338, 18319, 18364, 18303, 18328, 
                                                                                                                                                      18341), class = "Date"), discharge_date = structure(c(18275, 
                                                                                                                                                                                                            18276, 18288, 18280, 18286, 18275, 18278, 18288, 18291, 18283, 
                                                                                                                                                                                                            18309, 18297, 18375, 18381, 18347, 18328, 18367, 18309, 18329, 
                                                                                                                                                                                                            18344), class = "Date")), row.names = c(NA, -20L), class = "data.frame")
#Solution
library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
library(lubridate)


#Convert patient_id column into numeric
df$patient_id <- as.numeric(df$patient_id)

#Create empty (well, 1 row) data.frame to
#collect output data
#This needs three additional columns
#(as indicated)
outdat <- data.frame(matrix(nrow = 1, ncol = 9), stringsAsFactors = FALSE)
names(outdat) <- c(names(df), "ref_discharge_date", "ref_hosp_id", "overlap")


#Logic:
#For each unique patient_id take all
#their records.
#For each row of each such set of records
#compare its discharge_date with the admn_date
#of all other records with admn_date >= its own
#admn_date
#Then register the time interval between this row's
#discharge_date and the compared row's admn_date
#as a numeric value ("overlap")
#The idea is that concurrent hospital stays will have
#negative overlaps as the admn_date (of the current stay)
#will precede the discharge_date (of the previous one)

for(i in 1:length(unique(df$patient_id))){
  
  #i <- 7
  curdat <- df %>% filter(patient_id == unique(df$patient_id)[i])
  curdat %<>% mutate(admn_date = lubridate::as_date(admn_date), 
                     discharge_date = lubridate::as_date(discharge_date))
  
  curdat %<>% arrange(admn_date)
  
  
  for(j in 1:nrow(curdat)){
    
    #j <- 1
    currow <- curdat[j, ]
    #otrows <- curdat[-j, ]
    #
    otrows <- curdat %>% filter(admn_date >= currow$admn_date)
    #otrows <- curdat
    
    
    for(k in 1:nrow(otrows)){
      
      otrows$ref_discharge_date[k] <- currow$discharge_date
      #otrows$refdisc[k] <- as_date(otrows$refdisc[k])
      otrows$ref_hosp_id[k] <- currow$hosp_id
      otrows$overlap[k] <- as.numeric(difftime(otrows$admn_date[k], currow$discharge_date))
      
    }
    otrows$ref_discharge_date <- as_date(otrows$ref_discharge_date)
    
    outdat <- bind_rows(outdat, otrows)
  }
  
}

rm(curdat, i, j, k, otrows, currow)


#Removing that NA row + removing all self-rows
outdat %<>% 
  filter(!is.na(patient_id)) %>%
  filter(discharge_date != ref_discharge_date)


#Filter out only negative overlaps
outdat %<>% filter(overlap < 0)

#Filter out only those records where the patient
#was admitted to different hospitals
outdat %<>% filter(hosp_id != ref_hosp_id)


outdat
#  row_id patient_id pack_id hosp_id  admn_date discharge_date ref_discharge_date ref_hosp_id overlap
# 1      2          1      62       2 2020-01-03     2020-01-15         2020-01-14           1     -11
# 2      5          1      20       2 2020-01-22     2020-01-25         2020-01-27           1      -5
# 3      8          2      72       4 2020-01-16     2020-01-27         2020-01-17           3      -1