按id和药物分组(日期<100天彼此)取最早和最晚的日期

Group by id and drug (with dates <100 days of each other) take the earliest and latest date

这是我的数据集:

mydata = data.frame (Id =c(1,1,1,1,1,1,1,1,1,1),
                     Date = c("2000-01-01","2000-01-05","2000-02-02", "2000-02-12", 
                             "2000-02-14","2000-05-13", "2000-05-15", "2000-05-17", 
                              "2000-05-16", "2000-05-20"),
                     drug = c("A","A","B","B","B","A","A","A","C","C"))

下面的代码告诉我按 ID 和药物分组的给药日期之间的区别。如您所见,对于药物 A,给药日期之间存在 >100 天的间隔。

mydata <- mydata %>% group_by(Id, drug) %>% mutate(Diff = difftime(Date, lag(Date), units = 'days'))

任务是按 id 和药物分组,并获取每种药物的最早和最晚给药日期,但如果同一类型药物之间的日期间隔 >100 天,则需要它拥有最早和最晚的日期行。

下面的代码允许我获取最早和最晚的日期,但我不确定如何在此处添加 100 天的间隔。

mydata %>% group_by(Id, drug) %>% 
  summarise(startDate = min(as.Date(Date),na.rm = T),
            endDate = max(as.Date(Date),na.rm = T))

下面是我希望得到的输出

mydata1 = data.frame (Id =c(1,1,1,1),
                     startDate = c("2000-01-01","2000-02-02","2000-05-13", "2000-05-16"),
                     endDate = c("2000-01-05", "2000-02-14", "2000-05-17", "2000-05-20"),
                     drug = c("A","B","A","C"))

如您所见,对于药物 A,有两行分别代表第一个开始日期和结束日期,然后是给药日期之间超过 100 天后的第二个开始日期和结束日期。

任何帮助将不胜感激!谢谢

您可以使用 cumsum 创建一个新分组:

library(dplyr)

mydata %>% 
  group_by(Id, drug) %>% 
  mutate(Diff = difftime(Date, lag(Date), units = 'days')) %>%  
  group_by(Id, drug, grp = cumsum(coalesce(Diff, as.difftime(0, units = 'days')) > 100)) %>% 
  summarise(startDate = min(as.Date(Date),na.rm = T),
            endDate = max(as.Date(Date),na.rm = T),
            .groups = "drop") %>% 
  select(-grp)

这个returns

# A tibble: 4 x 4
     Id drug  startDate  endDate   
  <dbl> <chr> <date>     <date>    
1     1 A     2000-01-01 2000-01-05
2     1 A     2000-05-13 2000-05-17
3     1 B     2000-02-02 2000-02-14
4     1 C     2000-05-16 2000-05-20