在数据框中按组填充日期 R

Fill dates by groups in a data frame R

我想从这里开始:

name code date usage result
Jennifer Aniston 23211 2021-11-04 345 1
Jennifer Aniston 23211 2021-11-05 260 1
Jennifer Aniston 23211 2021-11-06 230 0
Jennifer Aniston 23211 2021-11-07 0 0
Matthew Perry 44215 2022-10-01 312 1
Matthew Perry 44215 2022-10-04 230 0
Matthew Perry 44215 2021-10-05 232 0
Lisa Kudrow 55120 2022-01-01 132 0
Lisa Kudrow 55120 2022-01-02 125 0
Lisa Kudrow 55120 2022-01-04 345 1
Lisa Kudrow 55120 2022-01-06 321 1
Lisa Kudrow 55120 2022-01-07 431 1

(注意:对于 Jennifer Aniston,我们拥有从出现的最小日期到最大日期的所有日期,但对于 Matthew Perry,我们缺少日期,它从 2022-10-01 开始,但我们没有10 月 2 日和 3 日。Lisa Kudrow 也是如此,它从 1 月 1 日开始,但我们错过了 1 月 3 日和 5 日)

对此:

name code date usage result
Jennifer Aniston 23211 2021-11-04 345 1
Jennifer Aniston 23211 2021-11-05 260 1
Jennifer Aniston 23211 2021-11-06 230 0
Jennifer Aniston 23211 2021-11-07 0 0
Matthew Perry 44215 2022-10-01 312 1
Matthew Perry 44215 2022-10-02 NA NA
Matthew Perry 44215 2022-10-03 NA NA
Matthew Perry 44215 2022-10-04 230 0
Matthew Perry 44215 2021-10-05 232 0
Lisa Kudrow 55120 2022-01-01 132 0
Lisa Kudrow 55120 2022-01-02 125 0
Lisa Kudrow 55120 2022-01-03 NA NA
Lisa Kudrow 55120 2022-01-04 345 1
Lisa Kudrow 55120 2022-01-05 NA NA
Lisa Kudrow 55120 2022-01-06 321 1
Lisa Kudrow 55120 2022-01-07 431 1

所以现在我们有了所有日期,在我们没有可用数据的地方填上 NA,并填上此人的姓名和代码。

知道如何在 R 中实现这一点吗? (最好使用 dplyr 和管道)

假设你的日期实际上是日期格式而不是字符格式(我们无法从问题中的 table 判断),并假设第 7 行的年份错误(2021 年)与 2022 年相反),您可以这样做:

library(tidyverse)

df %>% 
  split(.$name) %>%
  lapply(function(x) {
    complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
             fill = list(name = x$name[1], code = x$code[1]))}) %>%
  bind_rows()
#> # A tibble: 16 x 5
#>    date       name              code usage result
#>    <date>     <chr>            <int> <int>  <int>
#>  1 2021-11-04 Jennifer Aniston 23211   345      1
#>  2 2021-11-05 Jennifer Aniston 23211   260      1
#>  3 2021-11-06 Jennifer Aniston 23211   230      0
#>  4 2021-11-07 Jennifer Aniston 23211     0      0
#>  5 2022-01-01 Lisa Kudrow      55120   132      0
#>  6 2022-01-02 Lisa Kudrow      55120   125      0
#>  7 2022-01-03 Lisa Kudrow      55120    NA     NA
#>  8 2022-01-04 Lisa Kudrow      55120   345      1
#>  9 2022-01-05 Lisa Kudrow      55120    NA     NA
#> 10 2022-01-06 Lisa Kudrow      55120   321      1
#> 11 2022-01-07 Lisa Kudrow      55120   431      1
#> 12 2022-10-01 Matthew Perry    44215   312      1
#> 13 2022-10-02 Matthew Perry    44215    NA     NA
#> 14 2022-10-03 Matthew Perry    44215    NA     NA
#> 15 2022-10-04 Matthew Perry    44215   230      0
#> 16 2022-10-05 Matthew Perry    44215   232      0

reprex package (v2.0.1)

创建于 2022-06-02

以可复制格式从问题中获取的数据

df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston", 
"Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry", 
"Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow", 
"Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L, 
23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L, 
55120L), date = structure(c(18935, 18936, 18937, 18938, 19266, 
19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"), 
    usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L, 
    345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L, 
    0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")

艾伦的回答很好。但是,由于我过去遇到 expand()complete() 的困难,所以我制作了一个我现在经常使用的自定义函数。也许您会发现它很有用(如果您改进了它,请告诉我 ;-))。

请参阅下面的示例,使用 Allan 的 post 中的 df

autocomplete <- function(.data, .nestingvars, .indexvar) {
  # initialize the index that will expand the groups formed in .data by .nestingvars
  index <- data.frame(index = seq.Date(from = min(.data[[.indexvar]]),
                                       to = max(.data[[.indexvar]]),
                                       by = "day"))
  names(index)[1] <- .indexvar
  # peform a cross join to get all possible combinations of .nestingvars and .indexvars
  index <- base::merge(index, unique(.data[.nestingvars]), by = NULL)
  # merge index and data
  out <- base::merge(.data, index, by = c(.nestingvars, .indexvar), all.y = TRUE)
  return(out)
}

library(dplyr)
split(df, ~ name, drop = TRUE) %>% 
  purrr::map(.x = ., 
             .f = ~ autocomplete(.data = ., 
                                 .nestingvars = c("name", "code"), 
                                 .indexvar = "date")) %>% 
  bind_rows()
#>                name  code       date usage result
#> 1  Jennifer Aniston 23211 2021-11-04   345      1
#> 2  Jennifer Aniston 23211 2021-11-05   260      1
#> 3  Jennifer Aniston 23211 2021-11-06   230      0
#> 4  Jennifer Aniston 23211 2021-11-07     0      0
#> 5       Lisa Kudrow 55120 2022-01-01   132      0
#> 6       Lisa Kudrow 55120 2022-01-02   125      0
#> 7       Lisa Kudrow 55120 2022-01-03    NA     NA
#> 8       Lisa Kudrow 55120 2022-01-04   345      1
#> 9       Lisa Kudrow 55120 2022-01-05    NA     NA
#> 10      Lisa Kudrow 55120 2022-01-06   321      1
#> 11      Lisa Kudrow 55120 2022-01-07   431      1
#> 12    Matthew Perry 44215 2022-10-01   312      1
#> 13    Matthew Perry 44215 2022-10-02    NA     NA
#> 14    Matthew Perry 44215 2022-10-03    NA     NA
#> 15    Matthew Perry 44215 2022-10-04   230      0
#> 16    Matthew Perry 44215 2022-10-05   232      0

reprex package (v2.0.1)

创建于 2022-06-02