对于每个 ID return r 中起始列的最早日期和结束列的最晚日期

For each ID return the earliest date from the start column and the latest date from the end column in r

我有一个数据集,每个 ID 都有多个开始日期和结束日期。我想从“startDate”列中获取最早的日期,从 endDate 列中获取最晚的日期。


data = data.frame(ID=c(1,1,1,1,2,2,2),
                  startDate= c("2018-01-31", "2018-01-31", "2018-01-31", "2019-06-06",
                          "2002-06-07", "2002-06-07", "2002-09-12"),
                  endDate = c(NA,NA,NA,"2019-07-09",NA,NA, "2002-10-02"))

这是我希望得到的输出:

data = data.frame(ID=c(1,2),
                  startDate= c("2018-01-31","2002-06-07"),
                  endDate = c("2019-07-09","2002-10-02"))

经过尝试,我已经找到了如何通过以下代码执行此操作,但如果可能的话,我更喜欢更高效的方法。我一直需要这样做,我宁愿不必创建两个单独的数据框。谢谢大家的帮助!

data_start <- data %>%
          group_by(ID) %>%
          arrange(startDate) %>%
             slice(1L)

data_end <- data %>%
  group_by(ID) %>%
  arrange(desc(endDate)) %>%
  slice(1L)

data <- left_join(data_start[,c(1,2)], data_end[,c(1,3)], by="ID")

您可以使用最小值和最大值,将变量用作日期

 data %>% group_by(ID) %>% 
      summarise(startDate = min(as.Date(startDate),na.rm = T),
                endDate = max(as.Date(endDate),na.rm = T))

或者 firstlast:

library(dplyr)
data %>% 
  group_by(ID) %>%
  summarise(
    startDate = first(startDate),
    endDate = last(endDate)
  )
# A tibble: 2 x 3
     ID startDate  endDate   
* <dbl> <chr>      <chr>     
1     1 2018-01-31 2019-07-09
2     2 2002-06-07 2002-10-02