在回顾前几行的同时创建一个新变量
Creating a new variable while looking back at previous rows
我有一个简单的患者就诊数据集:
date infection
2005-01-01 yes
2005-06-30 yes
2005-10-15 yes
2006-01-01 no
2006-06-01 no
2006-11-01 yes
2006-12-01 no
2007-11-15 yes
在 R 中,我想添加一个名为 chronic
的列,它的值是 yes
、no
、NA
.
- 仅当感染=='yes' 当前日期并且在过去 365 天内有两个感染=='yes' 行时才显示值
yes
。
- 否则,如果在过去的 365 天内没有两次访问,则需要
NA
- 否则需要
no
所以最终的数据集看起来像这样:
date infection chronic
2005-01-01 yes NA
2005-06-30 yes NA
2005-10-15 yes yes
2006-01-01 no no
2006-06-01 no no
2006-11-01 yes no
2006-12-01 no no
2007-11-15 yes NA
我该如何编码?理想情况下,我想使用 dplyr
,但我愿意接受任何解决方案。谢谢!
可以使用此代码重新创建数据集:
dat <- data.frame(date = c(as.Date("2005-01-01"), as.Date("2005-06-30"), as.Date("2005-10-15"), as.Date("2006-01-01"), as.Date("2006-06-01"), as.Date("2006-11-01"), as.Date("2006-12-01"), as.Date("2007-11-15")), infection = c("yes", "yes", "yes", "no", "no", "yes", "no", "yes"))
您可以尝试使用 purrr
中的 map
函数:
library(dplyr)
library(purrr)
dat %>%
mutate(chronic = map2_chr(date, infection,
~case_when(.y == 'yes' &
sum(infection[between(date, .x-365, .x - 1)] == 'yes') >= 2 ~ 'yes',
.y == 'yes' &
sum(infection[between(date, .x-365, .x - 1)] == 'yes') != 2 ~ NA_character_,
TRUE ~ 'no')))
# date infection chronic
#1 2005-01-01 yes <NA>
#2 2005-06-30 yes <NA>
#3 2005-10-15 yes yes
#4 2006-01-01 no no
#5 2006-06-01 no no
#6 2006-11-01 yes <NA>
#7 2006-12-01 no no
#8 2007-11-15 yes <NA>
使用 data.table
和基于范围(非等值)的替代方案 merge/join。
library(data.table)
library(magrittr) # not required, just used to show the flow
dat <- fread(text = "
date infection
2005-01-01 yes
2005-06-30 yes
2005-10-15 yes
2006-01-01 no
2006-06-01 no
2006-11-01 yes
2006-12-01 no
2007-11-15 yes")[, date := as.Date(date)]
代码:
copy(dat) %>%
.[, c("date0", "date1") := .(date - 365, date)] %>%
dat[., on = .(date >= date0, date <= date1) ] %>%
.[, .(infection = last(infection), n_visits = .N,
n_infect = sum(infection == "yes")), by = .(i.date)] %>%
setnames(., "i.date", "date") %>%
.[, chronic := fcase(
n_visits < 3, NA_character_,
infection == "yes" & n_infect >= 2, "yes",
rep(TRUE, .N), "no") ] %>%
.[]
# date infection n_visits n_infect chronic
# <Date> <char> <int> <int> <char>
# 1: 2005-01-01 yes 1 1 <NA>
# 2: 2005-06-30 yes 2 2 <NA>
# 3: 2005-10-15 yes 3 3 yes
# 4: 2006-01-01 no 4 3 no
# 5: 2006-06-01 no 4 2 no
# 6: 2006-11-01 yes 3 1 no
# 7: 2006-12-01 no 4 1 no
# 8: 2007-11-15 yes 2 1 <NA>
我有一个简单的患者就诊数据集:
date infection
2005-01-01 yes
2005-06-30 yes
2005-10-15 yes
2006-01-01 no
2006-06-01 no
2006-11-01 yes
2006-12-01 no
2007-11-15 yes
在 R 中,我想添加一个名为 chronic
的列,它的值是 yes
、no
、NA
.
- 仅当感染=='yes' 当前日期并且在过去 365 天内有两个感染=='yes' 行时才显示值
yes
。 - 否则,如果在过去的 365 天内没有两次访问,则需要
NA
- 否则需要
no
所以最终的数据集看起来像这样:
date infection chronic
2005-01-01 yes NA
2005-06-30 yes NA
2005-10-15 yes yes
2006-01-01 no no
2006-06-01 no no
2006-11-01 yes no
2006-12-01 no no
2007-11-15 yes NA
我该如何编码?理想情况下,我想使用 dplyr
,但我愿意接受任何解决方案。谢谢!
可以使用此代码重新创建数据集:
dat <- data.frame(date = c(as.Date("2005-01-01"), as.Date("2005-06-30"), as.Date("2005-10-15"), as.Date("2006-01-01"), as.Date("2006-06-01"), as.Date("2006-11-01"), as.Date("2006-12-01"), as.Date("2007-11-15")), infection = c("yes", "yes", "yes", "no", "no", "yes", "no", "yes"))
您可以尝试使用 purrr
中的 map
函数:
library(dplyr)
library(purrr)
dat %>%
mutate(chronic = map2_chr(date, infection,
~case_when(.y == 'yes' &
sum(infection[between(date, .x-365, .x - 1)] == 'yes') >= 2 ~ 'yes',
.y == 'yes' &
sum(infection[between(date, .x-365, .x - 1)] == 'yes') != 2 ~ NA_character_,
TRUE ~ 'no')))
# date infection chronic
#1 2005-01-01 yes <NA>
#2 2005-06-30 yes <NA>
#3 2005-10-15 yes yes
#4 2006-01-01 no no
#5 2006-06-01 no no
#6 2006-11-01 yes <NA>
#7 2006-12-01 no no
#8 2007-11-15 yes <NA>
使用 data.table
和基于范围(非等值)的替代方案 merge/join。
library(data.table)
library(magrittr) # not required, just used to show the flow
dat <- fread(text = "
date infection
2005-01-01 yes
2005-06-30 yes
2005-10-15 yes
2006-01-01 no
2006-06-01 no
2006-11-01 yes
2006-12-01 no
2007-11-15 yes")[, date := as.Date(date)]
代码:
copy(dat) %>%
.[, c("date0", "date1") := .(date - 365, date)] %>%
dat[., on = .(date >= date0, date <= date1) ] %>%
.[, .(infection = last(infection), n_visits = .N,
n_infect = sum(infection == "yes")), by = .(i.date)] %>%
setnames(., "i.date", "date") %>%
.[, chronic := fcase(
n_visits < 3, NA_character_,
infection == "yes" & n_infect >= 2, "yes",
rep(TRUE, .N), "no") ] %>%
.[]
# date infection n_visits n_infect chronic
# <Date> <char> <int> <int> <char>
# 1: 2005-01-01 yes 1 1 <NA>
# 2: 2005-06-30 yes 2 2 <NA>
# 3: 2005-10-15 yes 3 3 yes
# 4: 2006-01-01 no 4 3 no
# 5: 2006-06-01 no 4 2 no
# 6: 2006-11-01 yes 3 1 no
# 7: 2006-12-01 no 4 1 no
# 8: 2007-11-15 yes 2 1 <NA>