在 R 中根据固定日期间隔大小的移动 window 改变新列
Mutate new column based on moving window of fixed date interval size, in R
我有 R 中一名患者的数据,其中显示了他们在某种情况下检测呈阳性的日期。数据如下所示:
date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes
现在我介绍一个新的定义。如果“当前测试为阳性, 和 >=50% 的前 365 天测试为阳性”,则患者的状况被定义为“慢性阳性”。 所以我想创建一个输出数据集,告诉我患者在每个日期是否长期呈阳性。例如,输出应如下所示(例如,在 2006 年 9 月 12 日,他们是“阳性”但不是“慢性阳性”,因为在过去 365 天内的 4 次访问中有 3 次是阴性的):
date positive chronic
2005-02-22 yes no
2005-04-26 no no
2005-08-02 yes yes
2005-10-04 no no
2005-12-06 yes yes
2006-03-14 no no
2006-06-06 no no
2006-09-12 yes no
2006-12-19 yes no
2007-03-27 yes yes
我该怎么做?在感兴趣的每一行,我需要能够查看之前的行(在过去 365 天内)并评估其中有多少比例是积极的。我想我可以使用 lead
/lag
函数和 dplyr
的组合,但我希望能举个例子来说明如何做到这一点。
原始数据可以通过以下方式复制:
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
这是一种方法 -
library(dplyr)
library(purrr)
dat %>%
mutate(chronic = map_chr(row_number(), ~{
inds <- between(date, date[.x] - 365, date[.x] - 1)
if(positive[.x] == "yes" && any(inds) && mean(positive[inds] == 'yes') >= 0.5) 'yes' else 'no'
}))
# date positive chronic
#1 2005-02-22 yes no
#2 2005-04-26 no no
#3 2005-08-02 yes yes
#4 2005-10-04 no no
#5 2005-12-06 yes yes
#6 2006-03-14 no no
#7 2006-06-06 no no
#8 2006-09-12 yes no
#9 2006-12-19 yes no
#10 2007-03-27 yes yes
您可以使用 slider
库进行这种滚动计算。语法解释-
slide_index_lgl
同时处理向量 .x
和索引 .i
并生成逻辑向量输出。
.x
用作positive
向量
.i
用作date
向量
.before
和 .after
不言自明(包括前 365 天,不包括当天)
.f
很简单,检查前 365 天的测试阳性
- 此输出与另一个条件相结合,即
positive == 'yes'
我使用了这个公式 (sum(.x == 'yes') / length(.x)) >= 0.5
- 1 被添加到此逻辑输出,为
FALSE
提供 1
,TRUE
提供 2
- 此完整输出用作输出向量的索引
c('No', 'Yes') so that you'll get
是for
TRUEand
否for
FALSE`
library(tidyverse)
df <- read.table(header = TRUE, text = 'date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes')
df$date <- as.Date(df$date)
library(slider)
library(lubridate)
df %>%
mutate(chronic = c('No', "Yes")[1 + (positive == 'yes' & slide_index_lgl(positive, date,
~ (sum(.x == 'yes') / length(.x)) >= 0.5 ,
.before = days(365),
.after = days(-1)))])
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no No
#> 3 2005-08-02 yes Yes
#> 4 2005-10-04 no No
#> 5 2005-12-06 yes Yes
#> 6 2006-03-14 no No
#> 7 2006-06-06 no No
#> 8 2006-09-12 yes No
#> 9 2006-12-19 yes No
#> 10 2007-03-27 yes Yes
在 baseR
中使用 runner::runner()
的替代策略
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
library(runner)
dat$chronic <- ifelse(runner(dat$positive, idx = dat$date, lag = '1 day',
k = '365 days',
f = \(.x) (sum(.x == 'yes')/length(.x)) >= 0.5) & dat$positive == 'yes', 'yes', 'no')
dat
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no no
#> 3 2005-08-02 yes yes
#> 4 2005-10-04 no no
#> 5 2005-12-06 yes yes
#> 6 2006-03-14 no no
#> 7 2006-06-06 no no
#> 8 2006-09-12 yes no
#> 9 2006-12-19 yes no
#> 10 2007-03-27 yes yes
如果您不想使用滚动函数,也可以使用此解决方案:
library(dplyr)
library(purrr)
library(lubridate)
map(df %>%
filter(positive == "yes") %>%
pull(date), ~ df %>% filter(date %within% interval(.x - days(365), .x))) %>%
map_dfr(~ .x %>%
summarise(date = last(date),
chronic = (sum(positive == "yes")-1)/ (n()-1) >= 0.5)) %>%
right_join(df, by = "date") %>%
arrange(date) %>%
mutate(chronic = if_else(is.na(chronic) | !chronic, "no", "yes"))
# A tibble: 10 x 3
date chronic positive
<chr> <chr> <chr>
1 2005-02-22 no yes
2 2005-04-26 no no
3 2005-08-02 yes yes
4 2005-10-04 no no
5 2005-12-06 yes yes
6 2006-03-14 no no
7 2006-06-06 no no
8 2006-09-12 no yes
9 2006-12-19 no yes
10 2007-03-27 yes yes
在 data.table
中使用非相等连接的另一个选项:
library(data.table)
setDT(dat)[, yrago := date - 365L]
dat[, chronic := fifelse(
.SD[.SD, on=.(date>=yrago, date<date),
by=.EACHI, .N>0 & i.positive=="yes" & sum(x.positive=="yes")/.N >= 0.5]$V1,
"yes", "no")
]
dat[, yrago := NULL][]
输出:
date positive chronic
1: 2005-02-22 yes no
2: 2005-04-26 no no
3: 2005-08-02 yes yes
4: 2005-10-04 no no
5: 2005-12-06 yes yes
6: 2006-03-14 no no
7: 2006-06-06 no no
8: 2006-09-12 yes no
9: 2006-12-19 yes no
10: 2007-03-27 yes yes
我有 R 中一名患者的数据,其中显示了他们在某种情况下检测呈阳性的日期。数据如下所示:
date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes
现在我介绍一个新的定义。如果“当前测试为阳性, 和 >=50% 的前 365 天测试为阳性”,则患者的状况被定义为“慢性阳性”。 所以我想创建一个输出数据集,告诉我患者在每个日期是否长期呈阳性。例如,输出应如下所示(例如,在 2006 年 9 月 12 日,他们是“阳性”但不是“慢性阳性”,因为在过去 365 天内的 4 次访问中有 3 次是阴性的):
date positive chronic
2005-02-22 yes no
2005-04-26 no no
2005-08-02 yes yes
2005-10-04 no no
2005-12-06 yes yes
2006-03-14 no no
2006-06-06 no no
2006-09-12 yes no
2006-12-19 yes no
2007-03-27 yes yes
我该怎么做?在感兴趣的每一行,我需要能够查看之前的行(在过去 365 天内)并评估其中有多少比例是积极的。我想我可以使用 lead
/lag
函数和 dplyr
的组合,但我希望能举个例子来说明如何做到这一点。
原始数据可以通过以下方式复制:
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
这是一种方法 -
library(dplyr)
library(purrr)
dat %>%
mutate(chronic = map_chr(row_number(), ~{
inds <- between(date, date[.x] - 365, date[.x] - 1)
if(positive[.x] == "yes" && any(inds) && mean(positive[inds] == 'yes') >= 0.5) 'yes' else 'no'
}))
# date positive chronic
#1 2005-02-22 yes no
#2 2005-04-26 no no
#3 2005-08-02 yes yes
#4 2005-10-04 no no
#5 2005-12-06 yes yes
#6 2006-03-14 no no
#7 2006-06-06 no no
#8 2006-09-12 yes no
#9 2006-12-19 yes no
#10 2007-03-27 yes yes
您可以使用 slider
库进行这种滚动计算。语法解释-
slide_index_lgl
同时处理向量.x
和索引.i
并生成逻辑向量输出。.x
用作positive
向量.i
用作date
向量.before
和.after
不言自明(包括前 365 天,不包括当天).f
很简单,检查前 365 天的测试阳性- 此输出与另一个条件相结合,即
positive == 'yes'
我使用了这个公式(sum(.x == 'yes') / length(.x)) >= 0.5
- 1 被添加到此逻辑输出,为
FALSE
提供1
,TRUE
提供 - 此完整输出用作输出向量的索引
c('No', 'Yes') so that you'll get
是for
TRUEand
否for
FALSE`
2
library(tidyverse)
df <- read.table(header = TRUE, text = 'date positive
2005-02-22 yes
2005-04-26 no
2005-08-02 yes
2005-10-04 no
2005-12-06 yes
2006-03-14 no
2006-06-06 no
2006-09-12 yes
2006-12-19 yes
2007-03-27 yes')
df$date <- as.Date(df$date)
library(slider)
library(lubridate)
df %>%
mutate(chronic = c('No', "Yes")[1 + (positive == 'yes' & slide_index_lgl(positive, date,
~ (sum(.x == 'yes') / length(.x)) >= 0.5 ,
.before = days(365),
.after = days(-1)))])
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no No
#> 3 2005-08-02 yes Yes
#> 4 2005-10-04 no No
#> 5 2005-12-06 yes Yes
#> 6 2006-03-14 no No
#> 7 2006-06-06 no No
#> 8 2006-09-12 yes No
#> 9 2006-12-19 yes No
#> 10 2007-03-27 yes Yes
在 baseR
中使用runner::runner()
的替代策略
dat <- structure(list(date = structure(c(12836, 12899, 12997, 13060, 13123, 13221, 13305, 13403, 13501, 13599), class = "Date"),
positive = c("yes", "no", "yes", "no", "yes", "no", "no", "yes", "yes", "yes")),
row.names = c(NA, 10L), class = "data.frame")
library(runner)
dat$chronic <- ifelse(runner(dat$positive, idx = dat$date, lag = '1 day',
k = '365 days',
f = \(.x) (sum(.x == 'yes')/length(.x)) >= 0.5) & dat$positive == 'yes', 'yes', 'no')
dat
#> date positive chronic
#> 1 2005-02-22 yes <NA>
#> 2 2005-04-26 no no
#> 3 2005-08-02 yes yes
#> 4 2005-10-04 no no
#> 5 2005-12-06 yes yes
#> 6 2006-03-14 no no
#> 7 2006-06-06 no no
#> 8 2006-09-12 yes no
#> 9 2006-12-19 yes no
#> 10 2007-03-27 yes yes
如果您不想使用滚动函数,也可以使用此解决方案:
library(dplyr)
library(purrr)
library(lubridate)
map(df %>%
filter(positive == "yes") %>%
pull(date), ~ df %>% filter(date %within% interval(.x - days(365), .x))) %>%
map_dfr(~ .x %>%
summarise(date = last(date),
chronic = (sum(positive == "yes")-1)/ (n()-1) >= 0.5)) %>%
right_join(df, by = "date") %>%
arrange(date) %>%
mutate(chronic = if_else(is.na(chronic) | !chronic, "no", "yes"))
# A tibble: 10 x 3
date chronic positive
<chr> <chr> <chr>
1 2005-02-22 no yes
2 2005-04-26 no no
3 2005-08-02 yes yes
4 2005-10-04 no no
5 2005-12-06 yes yes
6 2006-03-14 no no
7 2006-06-06 no no
8 2006-09-12 no yes
9 2006-12-19 no yes
10 2007-03-27 yes yes
在 data.table
中使用非相等连接的另一个选项:
library(data.table)
setDT(dat)[, yrago := date - 365L]
dat[, chronic := fifelse(
.SD[.SD, on=.(date>=yrago, date<date),
by=.EACHI, .N>0 & i.positive=="yes" & sum(x.positive=="yes")/.N >= 0.5]$V1,
"yes", "no")
]
dat[, yrago := NULL][]
输出:
date positive chronic
1: 2005-02-22 yes no
2: 2005-04-26 no no
3: 2005-08-02 yes yes
4: 2005-10-04 no no
5: 2005-12-06 yes yes
6: 2006-03-14 no no
7: 2006-06-06 no no
8: 2006-09-12 yes no
9: 2006-12-19 yes no
10: 2007-03-27 yes yes