根据缺少行的宽数据计算的时间差

Question

有一个宽格式的纵向数据集，我想从中计算第一个观察日期和最后一个观察日期之间的时间（以年和天为单位）。日期的格式为 yyyy-mm-dd。数据集有四个观察期缺失日期，示例如下

df1<-data.frame("id"=c(1:4),
           "adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
           "bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
           "cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
           "ddate"=c("2014-06-15",NA,"2014-04-11",NA))

这里的“adate”是第一个日期，最后一个日期是最后一次见到某人的日期。为了计算时差（lastdate-adate），我尝试使用“lubridate”包，例如

lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")

但是，最后一个日期不是来自某一列，这让我很困惑。我正在寻找一种在 R 中自动计算的方法。预期的输出看起来像

  id years days
1  1  2.99 1093
2  2  2.00  731
3  3  3.01 1098
4  4  1.01  369

年份近似为小数点后两位。

Answer 1

我们可以使用 pmap

library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
    mutate(out = pmap(.[-1], ~ {
      dates <- as.Date(na.omit(c(...)))
      tibble(years = lubridate::time_length(difftime(last(dates), 
            first(dates)), "years"), 
       days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
           })) %>% 
   unnest_wider(out)
# A tibble: 4 x 7
#     id adate      bdate      cdate      ddate      years  days
#  <int> <chr>      <chr>      <chr>      <chr>      <dbl> <dbl>
#1     1 2011-06-18 2012-06-15 2013-06-18 2014-06-15  2.99  1093
#2     2 2011-06-18 2012-06-15 2013-06-18 <NA>        2.00   731
#3     3 2011-04-09 <NA>       2013-04-09 2014-04-11  3.01  1098
#4     4 2011-05-20 2012-05-23 <NA>       <NA>        1.01   369

Answer 2

另一个 tidyverse 解决方案可以通过将数据转换为长格式，删除 NA 日期，并获取每个 id 的最后一个日期和第一个日期之间的时差来完成。

library(dplyr)
library(tidyr)
library(lubridate)

df1 %>% 
  pivot_longer(-id) %>% 
  na.omit %>% 
  group_by(id) %>% 
  mutate(value = as.Date(value)) %>% 
  summarise(years = time_length(difftime(last(value), first(value)),"years"),
            days = as.numeric(difftime(last(value), first(value))))

#> # A tibble: 4 x 3
#>      id years  days
#>   <int> <dbl> <dbl>
#> 1     1  2.99  1093
#> 2     2  2.00   731
#> 3     3  3.01  1098
#> 4     4  1.01   369

Answer 3

这里介绍的大部分功能可能都比较复杂。如果可能，您应该尝试学习它们。虽然将提供一个 Base R 方法：

grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:

days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference

cbind(df[1],days, years = round(days/365,2)) # Create your table

    id days years
1.1  1 1093  2.99
2.2  2  731  2.00
3.3  3 1098  3.01
4.4  4  369  1.01

如果对其他高级功能感到满意，那么您可以这样做：

dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
  id adate  year
1  1 1093  2.99 
2  2  731  2.00 
3  3 1098  3.01 
4  4  369  1.01

Answer 4

使用基数 R apply :

df1[-1] <- lapply(df1[-1], as.Date)

df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
      x <- na.omit(x)
      x1 <- difftime(x[length(x)], x[1], 'days')
      c(x1/365, x1)
}))

df1[c('id', 'years', 'days')]
#  id    years days
#1  1 2.994521 1093
#2  2 2.002740  731
#3  3 3.008219 1098
#4  4 1.010959  369

根据缺少行的宽数据计算的时间差

Time difference calculated from wide data with missing rows

r

date

dataframe

lubridate

difftime