R通过另一个变量的计数过滤一个变量，但仅在一天间隔内计数

Question

这是我正在使用的数据框：

df <- tribble(
  ~Patient, ~date, ~Doctor
  "A", "2020-01-01", "A",
  "A", "2020-03-01", "A",
  "A", "2020-04-30", "B",
  "A", "2020-06-29", "C",
  "A", "2020-08-28", "A",
  "B", "2020-01-01", "A",
  "B", "2020-03-01","B",
  "B", "2020-04-30","B",
  "B", "2020-06-29","B",
  "B", "2020-08-28","C",
  "C", "2020-04-30","A",
  "C", "2020-06-29","A",
  "C", "2020-08-28","B",
  "C", "2020-10-27","C",
  "C", "2020-12-26","A",
)

如您所见，共有三列：Patient、date和Doctor。

这是我正在努力实现的所需数据框。

desired_df <- tribble(
  ~Patient, ~Number_of_Diff_Doctors_within_180_days, 
  "A", "3", 
  "B", "2", 
  "C", "3", 
)

逻辑如下：我正在尝试 return 一个数据框，每个患者都有一个唯一值，以及该患者在 180 天内看过的医生数量 window。这180天的时间段就像是搬家window，任务就是算出病人在任意180天window期间最多看多少医生

例子中，患者A在2020-03-01到2020-06-29之间有3位不同的医生，医生A、B、C，<180window，所以这位患者得到对应于三位医生的 1 的代码。但是同样有三位医生的患者 B，在 2020-01-01 看到了医生 A，在 2020-08-28 看到了医生 C，所以在任何 180 天里只有两个医生 window。并且患者C与患者A在时间间隔上相同，除了日期向前移动。

这是我到目前为止的尝试。它对日期逻辑没有任何作用，因为我不知道我在做什么。

attempt <- df %>%
  dplyr::select(Patient, Doctor) %>%
  dplyr::group_by(Patient, Doctor) %>%
  distinct() %>%
  dplyr::group_by(Patient) %>%
  tally() %>%
  filter(n > 1)

Answer 1

根据 OP 编辑更新解决方案。

首先让我们得到一个整洁的数据框，其中包含患者就诊的累计天数：

df2 <- df %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(Patient) %>% 
  mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
         cum_days = cumsum(days_btwn)) %>% 
  ungroup

示例 df2 输出：

# A tibble: 15 × 5
   Patient date       Doctor days_btwn cum_days
   <chr>   <date>     <chr>      <dbl>    <dbl>
 1 A       2020-01-01 A              0        0
 2 A       2020-03-01 A             60       60
 3 A       2020-04-30 B             60      120
 4 A       2020-06-29 C             60      180
 5 A       2020-08-28 A             60      240
 6 B       2020-01-01 A              0        0
#...

接下来，我们可以遍历每个 Patient（基本上是一个 group-by 操作），并迭代地对访问周期的滚动 windows 进行采样。计算总天数 <= 180 的每个 window 中唯一 Doctor 值的最大数量，并将所有患者的结果合并到一个数据框中。


unique(df2$Patient) %>% 
  map_dfr(function(pat) {
    this_pat <- df2 %>% filter(Patient == pat)
    n_obs <- nrow(this_pat)
    max_docs <- n_distinct(this_pat$Doctor)
    n_docs <- 0
    max_win_docs <- 0
    for (i in 1:n_obs) {
      for (j in 1:n_obs) {
        win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
        if (win_days <= 180) {
          n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
          if (n_docs > max_win_docs) max_win_docs <- n_docs
          if (max_win_docs == max_docs) next
        }
      }
    }
    list(patient = pat, n_diff_docs_within_180 = max_win_docs)
  }
)

输出

# A tibble: 3 × 2
  patient n_diff_docs_within_180
  <chr>                    <int>
1 A                            3
2 B                            2
3 C                            3

Answer 2

你所说的“180 天内”有点含糊。在什么日期的 180 天内？

这决定了每位患者在每次就诊后的 180 天内就诊的不同医生的数量。

library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
       Patient       date count
##  1:       A 2020-01-01     3
##  2:       A 2020-03-01     3
##  3:       A 2020-04-30     3
##  4:       A 2020-06-29     2
##  5:       A 2020-08-28     1
##  6:       B 2020-01-01     2
##  7:       B 2020-03-01     2
##  8:       B 2020-04-30     2
##  9:       B 2020-06-29     2
## 10:       B 2020-08-28     1
## 11:       C 2020-04-30     3
## 12:       C 2020-06-29     3
## 13:       C 2020-08-28     3
## 14:       C 2020-10-27     2
## 15:       C 2020-12-26     1

因此，患者 A 在 2020-01-01 的 180 天内拜访了 3 位医生（第 1 行），但在 2020-06-29 的 180 天内只拜访了 2 位医生（第 4 行）。显然，如果数据集在给定日期后不到 180 天结束，我们真的不知道在该时间范围内将发生的访问次数。

您问题中的预期结果似乎是基于每位患者的首次就诊。我们可以提取如下：

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
##    Patient       date count
## 1:       A 2020-01-01     3
## 2:       B 2020-01-01     2
## 3:       C 2020-04-30     3

编辑：基于 OP 评论。每个患者的最大计数由

给出

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
       , .(maxCount=max(count)),   by=.(Patient)]
##    Patient maxCount
## 1:       A        3
## 2:       B        2
## 3:       C        3

Answer 3

使用 runner 包进行这样的滚动 window 计算。太棒了

library(tidyverse)
library(lubridate)
library(runner)


df <- tribble(
    ~Patient, ~date, ~Doctor,
    "A", "2020-01-01", "A",
    "A", "2020-03-01", "A",
    "A", "2020-04-30", "B",
    "A", "2020-06-29", "C",
    "A", "2020-08-28", "A",
    "B", "2020-01-01", "A",
    "B", "2020-03-01","B",
    "B", "2020-04-30","B",
    "B", "2020-06-29","B",
    "B", "2020-08-28","C",
    "C", "2020-04-30","A",
    "C", "2020-06-29","A",
    "C", "2020-08-28","B",
    "C", "2020-10-27","C",
    "C", "2020-12-26","A",
) %>% 
    mutate(date = ymd(date))

df %>% 
    group_by(Patient) %>% 
    mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>% 
    summarize(num_docs = max(num_docs))

# A tibble: 3 × 2
  Patient num_docs
  <chr>      <int>
1 A              3
2 B              2
3 C              3

R通过另一个变量的计数过滤一个变量，但仅在一天间隔内计数

R filter a variable by the count of another variable, but only counted within a day interval max

r

dplyr