R通过另一个变量的计数过滤一个变量,但仅在一天间隔内计数

R filter a variable by the count of another variable, but only counted within a day interval max

这是我正在使用的数据框:

df <- tribble(
  ~Patient, ~date, ~Doctor
  "A", "2020-01-01", "A",
  "A", "2020-03-01", "A",
  "A", "2020-04-30", "B",
  "A", "2020-06-29", "C",
  "A", "2020-08-28", "A",
  "B", "2020-01-01", "A",
  "B", "2020-03-01","B",
  "B", "2020-04-30","B",
  "B", "2020-06-29","B",
  "B", "2020-08-28","C",
  "C", "2020-04-30","A",
  "C", "2020-06-29","A",
  "C", "2020-08-28","B",
  "C", "2020-10-27","C",
  "C", "2020-12-26","A",
)

如您所见,共有三列:PatientdateDoctor

这是我正在努力实现的所需数据框。

desired_df <- tribble(
  ~Patient, ~Number_of_Diff_Doctors_within_180_days, 
  "A", "3", 
  "B", "2", 
  "C", "3", 
)

逻辑如下: 我正在尝试 return 一个数据框,每个患者都有一个唯一值,以及该患者在 180 天内看过的医生数量 window。这180天的时间段就像是搬家window,任务就是算出病人在任意180天window期间最多看多少医生

例子中,患者A在2020-03-01到2020-06-29之间有3位不同的医生,医生A、B、C,<180window,所以这位患者得到对应于三位医生的 1 的代码。但是同样有三位医生的患者 B,在 2020-01-01 看到了医生 A,在 2020-08-28 看到了医生 C,所以在任何 180 天里只有两个医生 window。并且患者C与患者A在时间间隔上相同,除了日期向前移动。

这是我到目前为止的尝试。它对日期逻辑没有任何作用,因为我不知道我在做什么。

attempt <- df %>%
  dplyr::select(Patient, Doctor) %>%
  dplyr::group_by(Patient, Doctor) %>%
  distinct() %>%
  dplyr::group_by(Patient) %>%
  tally() %>%
  filter(n > 1)

根据 OP 编辑​​更新解决方案。

首先让我们得到一个整洁的数据框,其中包含患者就诊的累计天数:

df2 <- df %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(Patient) %>% 
  mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
         cum_days = cumsum(days_btwn)) %>% 
  ungroup

示例 df2 输出:

# A tibble: 15 × 5
   Patient date       Doctor days_btwn cum_days
   <chr>   <date>     <chr>      <dbl>    <dbl>
 1 A       2020-01-01 A              0        0
 2 A       2020-03-01 A             60       60
 3 A       2020-04-30 B             60      120
 4 A       2020-06-29 C             60      180
 5 A       2020-08-28 A             60      240
 6 B       2020-01-01 A              0        0
#...

接下来,我们可以遍历每个 Patient(基本上是一个 group-by 操作),并迭代地对访问周期的滚动 windows 进行采样。计算总天数 <= 180 的每个 window 中唯一 Doctor 值的最大数量,并将所有患者的结果合并到一个数据框中。


unique(df2$Patient) %>% 
  map_dfr(function(pat) {
    this_pat <- df2 %>% filter(Patient == pat)
    n_obs <- nrow(this_pat)
    max_docs <- n_distinct(this_pat$Doctor)
    n_docs <- 0
    max_win_docs <- 0
    for (i in 1:n_obs) {
      for (j in 1:n_obs) {
        win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
        if (win_days <= 180) {
          n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
          if (n_docs > max_win_docs) max_win_docs <- n_docs
          if (max_win_docs == max_docs) next
        }
      }
    }
    list(patient = pat, n_diff_docs_within_180 = max_win_docs)
  }
)

输出

# A tibble: 3 × 2
  patient n_diff_docs_within_180
  <chr>                    <int>
1 A                            3
2 B                            2
3 C                            3

你所说的“180 天内”有点含糊。在什么日期的 180 天内?

这决定了每位患者在每次就诊后的 180 天内就诊的不同医生的数量。

library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
       Patient       date count
##  1:       A 2020-01-01     3
##  2:       A 2020-03-01     3
##  3:       A 2020-04-30     3
##  4:       A 2020-06-29     2
##  5:       A 2020-08-28     1
##  6:       B 2020-01-01     2
##  7:       B 2020-03-01     2
##  8:       B 2020-04-30     2
##  9:       B 2020-06-29     2
## 10:       B 2020-08-28     1
## 11:       C 2020-04-30     3
## 12:       C 2020-06-29     3
## 13:       C 2020-08-28     3
## 14:       C 2020-10-27     2
## 15:       C 2020-12-26     1

因此,患者 A 在 2020-01-01 的 180 天内拜访了 3 位医生(第 1 行),但在 2020-06-29 的 180 天内只拜访了 2 位医生(第 4 行)。显然,如果数据集在给定日期后不到 180 天结束,我们真的不知道在该时间范围内将发生的访问次数。

您问题中的预期结果似乎是基于每位患者的首次就诊。我们可以提取如下:

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
##    Patient       date count
## 1:       A 2020-01-01     3
## 2:       B 2020-01-01     2
## 3:       C 2020-04-30     3

编辑:基于 OP 评论。 每个患者的最大计数由

给出
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
       , .(maxCount=max(count)),   by=.(Patient)]
##    Patient maxCount
## 1:       A        3
## 2:       B        2
## 3:       C        3

使用 runner 包进行这样的滚动 window 计算。太棒了

library(tidyverse)
library(lubridate)
library(runner)


df <- tribble(
    ~Patient, ~date, ~Doctor,
    "A", "2020-01-01", "A",
    "A", "2020-03-01", "A",
    "A", "2020-04-30", "B",
    "A", "2020-06-29", "C",
    "A", "2020-08-28", "A",
    "B", "2020-01-01", "A",
    "B", "2020-03-01","B",
    "B", "2020-04-30","B",
    "B", "2020-06-29","B",
    "B", "2020-08-28","C",
    "C", "2020-04-30","A",
    "C", "2020-06-29","A",
    "C", "2020-08-28","B",
    "C", "2020-10-27","C",
    "C", "2020-12-26","A",
) %>% 
    mutate(date = ymd(date))

df %>% 
    group_by(Patient) %>% 
    mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>% 
    summarize(num_docs = max(num_docs))

# A tibble: 3 × 2
  Patient num_docs
  <chr>      <int>
1 A              3
2 B              2
3 C              3