Lead/lag 在 R 中,但仅适用于满足 yes/no 条件的行

Lead/lag in R but only for rows that meet a yes/no condition

我有一个患者就诊数据集,如下所示:

   visit infection treatment
1      1  negative         1
2      2  negative         1
3      3  positive         1
4      4  negative         0
5      5  positive         1
6      6  positive         0
7      7  positive         1
8      8  negative         0
9      9  negative         1
10    10  negative         1
11    11  negative         0
12    12  positive         1
13    13  positive         1

我想创建一个列,treatment_second_neg_visit,它告诉我患者在手头就诊后的第二次感染阴性就诊时是否接受了治疗(并且 NA 如果两次阴性-感染访问不跟随手头的访问)。基本上,lead/lag但只有在满足特定条件时才可以。

注意:即使是阳性感染的行,我仍然对第二次感染感兴趣-阴性随后的访问。

示例1:对于第一次访问(第1行),下一次负访问是第2行,second 负访问是第 4 行,其中 treatment=0。因此,第 1 行的 treatment_second_neg_visit 的值应为 0。

示例 2: 第二次访问(第 2 行),下一次 负访问是第 4 行,second 负访问是第 8 行,其中 treatment=0。因此,第 2 行的 treatment_second_neg_visit 的值应为 0。

最终输出应该是:

visit    infection  treatment treatment_second_neg_visit
    1     negative          1                          0
    2     negative          1                          0
    3     positive          1                          0
    4     negative          0                          1
    5     positive          1                          1
    6     positive          0                          1
    7     positive          1                          1
    8     negative          0                          1
    9     negative          1                          0
    10    negative          1                          NA
    11    negative          0                          NA
    12    positive          1                          NA
    13    positive          1                          NA

创建数据集的代码:

dat <- data.frame(visit = 1:13, infection = c("negative", "negative", "positive", "negative", "positive", "positive", "positive", "negative", "negative", "negative", "negative", "positive", "positive"), treatment = c(1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1))

Base R 或 dplyr 是理想的,但对任何正确的解决方案都开放。

1) 首先创建一个列 neg 给出到目前为止的负数,然后在指定的条件下执行左自连接。

library(sqldf)

dat2 <- transform(dat, neg = cumsum(infection == 'negative'))

sqldf("select a.visit, a.infection, a.treatment, b.treatment second
  from dat2 a
  left join dat2 b on a.neg + 2 = b.neg and b.infection = 'negative' ")

给予:

   visit infection treatment second
1      1  negative         1      0
2      2  negative         1      0
3      3  positive         1      0
4      4  negative         0      1
5      5  positive         1      1
6      6  positive         0      1
7      7  positive         1      1
8      8  negative         0      1
9      9  negative         1      0
10    10  negative         1     NA
11    11  negative         0     NA
12    12  positive         1     NA
13    13  positive         1     NA

或者我们可以在一个 sql 语句中完成所有操作:

sqldf("with dat2 as (
  select *, sum(infection = 'negative') over (rows unbounded preceding) neg
  from dat
)
select a.visit, a.infection, a.treatment, b.treatment second
  from dat2 a
  left join dat2 b on a.neg + 2 = b.neg and b.infection = 'negative' ")

2) dplyr dat2dat 有一个额外的列给出了负片的数量,包括当前行。然后我们执行指示的左连接。

library(dplyr)

dat2 <- dat %>%
  mutate(neg = cumsum(infection == 'negative'))
   
dat2 %>%
  mutate(neg = neg + 2) %>% 
  left_join(filter(dat2, infection == 'negative'), "neg", suffix = c("", ".y")) %>%
  select(visit, infection, treatment, second = treatment.y)