R通过另一个变量的计数过滤一个变量,但仅在一天间隔内计数
R filter a variable by the count of another variable, but only counted within a day interval max
这是我正在使用的数据框:
df <- tribble(
~Patient, ~date, ~Doctor
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
)
如您所见,共有三列:Patient
、date
和Doctor
。
这是我正在努力实现的所需数据框。
desired_df <- tribble(
~Patient, ~Number_of_Diff_Doctors_within_180_days,
"A", "3",
"B", "2",
"C", "3",
)
逻辑如下:
我正在尝试 return 一个数据框,每个患者都有一个唯一值,以及该患者在 180 天内看过的医生数量 window。这180天的时间段就像是搬家window,任务就是算出病人在任意180天window期间最多看多少医生
例子中,患者A在2020-03-01到2020-06-29之间有3位不同的医生,医生A、B、C,<180window,所以这位患者得到对应于三位医生的 1 的代码。但是同样有三位医生的患者 B,在 2020-01-01 看到了医生 A,在 2020-08-28 看到了医生 C,所以在任何 180 天里只有两个医生 window。并且患者C与患者A在时间间隔上相同,除了日期向前移动。
这是我到目前为止的尝试。它对日期逻辑没有任何作用,因为我不知道我在做什么。
attempt <- df %>%
dplyr::select(Patient, Doctor) %>%
dplyr::group_by(Patient, Doctor) %>%
distinct() %>%
dplyr::group_by(Patient) %>%
tally() %>%
filter(n > 1)
根据 OP 编辑更新解决方案。
首先让我们得到一个整洁的数据框,其中包含患者就诊的累计天数:
df2 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(Patient) %>%
mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
cum_days = cumsum(days_btwn)) %>%
ungroup
示例 df2
输出:
# A tibble: 15 × 5
Patient date Doctor days_btwn cum_days
<chr> <date> <chr> <dbl> <dbl>
1 A 2020-01-01 A 0 0
2 A 2020-03-01 A 60 60
3 A 2020-04-30 B 60 120
4 A 2020-06-29 C 60 180
5 A 2020-08-28 A 60 240
6 B 2020-01-01 A 0 0
#...
接下来,我们可以遍历每个 Patient
(基本上是一个 group-by 操作),并迭代地对访问周期的滚动 windows 进行采样。计算总天数 <= 180 的每个 window 中唯一 Doctor
值的最大数量,并将所有患者的结果合并到一个数据框中。
unique(df2$Patient) %>%
map_dfr(function(pat) {
this_pat <- df2 %>% filter(Patient == pat)
n_obs <- nrow(this_pat)
max_docs <- n_distinct(this_pat$Doctor)
n_docs <- 0
max_win_docs <- 0
for (i in 1:n_obs) {
for (j in 1:n_obs) {
win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
if (win_days <= 180) {
n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
if (n_docs > max_win_docs) max_win_docs <- n_docs
if (max_win_docs == max_docs) next
}
}
}
list(patient = pat, n_diff_docs_within_180 = max_win_docs)
}
)
输出
# A tibble: 3 × 2
patient n_diff_docs_within_180
<chr> <int>
1 A 3
2 B 2
3 C 3
你所说的“180 天内”有点含糊。在什么日期的 180 天内?
这决定了每位患者在每次就诊后的 180 天内就诊的不同医生的数量。
library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
Patient date count
## 1: A 2020-01-01 3
## 2: A 2020-03-01 3
## 3: A 2020-04-30 3
## 4: A 2020-06-29 2
## 5: A 2020-08-28 1
## 6: B 2020-01-01 2
## 7: B 2020-03-01 2
## 8: B 2020-04-30 2
## 9: B 2020-06-29 2
## 10: B 2020-08-28 1
## 11: C 2020-04-30 3
## 12: C 2020-06-29 3
## 13: C 2020-08-28 3
## 14: C 2020-10-27 2
## 15: C 2020-12-26 1
因此,患者 A 在 2020-01-01 的 180 天内拜访了 3 位医生(第 1 行),但在 2020-06-29 的 180 天内只拜访了 2 位医生(第 4 行)。显然,如果数据集在给定日期后不到 180 天结束,我们真的不知道在该时间范围内将发生的访问次数。
您问题中的预期结果似乎是基于每位患者的首次就诊。我们可以提取如下:
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
## Patient date count
## 1: A 2020-01-01 3
## 2: B 2020-01-01 2
## 3: C 2020-04-30 3
编辑:基于 OP 评论。
每个患者的最大计数由
给出
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
, .(maxCount=max(count)), by=.(Patient)]
## Patient maxCount
## 1: A 3
## 2: B 2
## 3: C 3
使用 runner
包进行这样的滚动 window 计算。太棒了
library(tidyverse)
library(lubridate)
library(runner)
df <- tribble(
~Patient, ~date, ~Doctor,
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
) %>%
mutate(date = ymd(date))
df %>%
group_by(Patient) %>%
mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>%
summarize(num_docs = max(num_docs))
# A tibble: 3 × 2
Patient num_docs
<chr> <int>
1 A 3
2 B 2
3 C 3
这是我正在使用的数据框:
df <- tribble(
~Patient, ~date, ~Doctor
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
)
如您所见,共有三列:Patient
、date
和Doctor
。
这是我正在努力实现的所需数据框。
desired_df <- tribble(
~Patient, ~Number_of_Diff_Doctors_within_180_days,
"A", "3",
"B", "2",
"C", "3",
)
逻辑如下: 我正在尝试 return 一个数据框,每个患者都有一个唯一值,以及该患者在 180 天内看过的医生数量 window。这180天的时间段就像是搬家window,任务就是算出病人在任意180天window期间最多看多少医生
例子中,患者A在2020-03-01到2020-06-29之间有3位不同的医生,医生A、B、C,<180window,所以这位患者得到对应于三位医生的 1 的代码。但是同样有三位医生的患者 B,在 2020-01-01 看到了医生 A,在 2020-08-28 看到了医生 C,所以在任何 180 天里只有两个医生 window。并且患者C与患者A在时间间隔上相同,除了日期向前移动。
这是我到目前为止的尝试。它对日期逻辑没有任何作用,因为我不知道我在做什么。
attempt <- df %>%
dplyr::select(Patient, Doctor) %>%
dplyr::group_by(Patient, Doctor) %>%
distinct() %>%
dplyr::group_by(Patient) %>%
tally() %>%
filter(n > 1)
根据 OP 编辑更新解决方案。
首先让我们得到一个整洁的数据框,其中包含患者就诊的累计天数:
df2 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(Patient) %>%
mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
cum_days = cumsum(days_btwn)) %>%
ungroup
示例 df2
输出:
# A tibble: 15 × 5
Patient date Doctor days_btwn cum_days
<chr> <date> <chr> <dbl> <dbl>
1 A 2020-01-01 A 0 0
2 A 2020-03-01 A 60 60
3 A 2020-04-30 B 60 120
4 A 2020-06-29 C 60 180
5 A 2020-08-28 A 60 240
6 B 2020-01-01 A 0 0
#...
接下来,我们可以遍历每个 Patient
(基本上是一个 group-by 操作),并迭代地对访问周期的滚动 windows 进行采样。计算总天数 <= 180 的每个 window 中唯一 Doctor
值的最大数量,并将所有患者的结果合并到一个数据框中。
unique(df2$Patient) %>%
map_dfr(function(pat) {
this_pat <- df2 %>% filter(Patient == pat)
n_obs <- nrow(this_pat)
max_docs <- n_distinct(this_pat$Doctor)
n_docs <- 0
max_win_docs <- 0
for (i in 1:n_obs) {
for (j in 1:n_obs) {
win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
if (win_days <= 180) {
n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
if (n_docs > max_win_docs) max_win_docs <- n_docs
if (max_win_docs == max_docs) next
}
}
}
list(patient = pat, n_diff_docs_within_180 = max_win_docs)
}
)
输出
# A tibble: 3 × 2
patient n_diff_docs_within_180
<chr> <int>
1 A 3
2 B 2
3 C 3
你所说的“180 天内”有点含糊。在什么日期的 180 天内?
这决定了每位患者在每次就诊后的 180 天内就诊的不同医生的数量。
library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
Patient date count
## 1: A 2020-01-01 3
## 2: A 2020-03-01 3
## 3: A 2020-04-30 3
## 4: A 2020-06-29 2
## 5: A 2020-08-28 1
## 6: B 2020-01-01 2
## 7: B 2020-03-01 2
## 8: B 2020-04-30 2
## 9: B 2020-06-29 2
## 10: B 2020-08-28 1
## 11: C 2020-04-30 3
## 12: C 2020-06-29 3
## 13: C 2020-08-28 3
## 14: C 2020-10-27 2
## 15: C 2020-12-26 1
因此,患者 A 在 2020-01-01 的 180 天内拜访了 3 位医生(第 1 行),但在 2020-06-29 的 180 天内只拜访了 2 位医生(第 4 行)。显然,如果数据集在给定日期后不到 180 天结束,我们真的不知道在该时间范围内将发生的访问次数。
您问题中的预期结果似乎是基于每位患者的首次就诊。我们可以提取如下:
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
## Patient date count
## 1: A 2020-01-01 3
## 2: B 2020-01-01 2
## 3: C 2020-04-30 3
编辑:基于 OP 评论。 每个患者的最大计数由
给出result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
, .(maxCount=max(count)), by=.(Patient)]
## Patient maxCount
## 1: A 3
## 2: B 2
## 3: C 3
使用 runner
包进行这样的滚动 window 计算。太棒了
library(tidyverse)
library(lubridate)
library(runner)
df <- tribble(
~Patient, ~date, ~Doctor,
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
) %>%
mutate(date = ymd(date))
df %>%
group_by(Patient) %>%
mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>%
summarize(num_docs = max(num_docs))
# A tibble: 3 × 2
Patient num_docs
<chr> <int>
1 A 3
2 B 2
3 C 3