我可以使用什么功能来完成和填充缺失的时间序列观察,避免在系列开始日期之前完成?

What function can I use to complete and fill missing time series observations, avoiding completion before the series start date?

我有按 id 分组的长时间序列数据框。该系列有不同的开始日期,也缺少观察结果。我想通过完成日期和 ID 并用 0 填充它来完成缺失的观察。

我想在这个过程中避免的是,在开始时完成缺失的观察,因为这只是一个指标,时间序列有一个较晚的起点(例如不同的产品发布日期)。

在我的 reprex 中,我使用了 tidyr 中的 complete。它与我想要的相反。它不是用“2015-01-04”完成 id“A1”,而是用“2015-01-01”完成 id“B1”,在这种情况下不需要。 complete 是否总是创建相同大小的组?也许那是错误的功能。

如何在下面的例子中实现相反的效果?

library(tidyr)

data <- data.frame (id = as.character(c(rep("A1",6),rep("B1",5))),
                    value = c(seq( 1, 9, length.out = 11)),
                    date = as.Date(c(c("2015-01-01","2015-01-02","2015-01-03",
                                         "2015-01-05","2015-01-06","2015-01-07"),
                                       c("2015-01-02","2015-01-03","2015-01-05",
                                         "2015-01-06","2015-01-07")
                                      )
                    )
)

data %>% complete(date, id, fill = list(value = 0)) 

做矩形最容易表达。 您可以按如下方式重新引入缺失:

data %>% 
  tidyr::complete(date, id, fill = list(value = 0)) %>%
  dplyr::group_by(id) %>%
  dplyr::arrange(date) %>%   # Ensure it's sorted by date
  dplyr::filter(!cumall(value == 0)) %>%  # Don't keep zeros that didn't have non-0 rows before
  dplyr::ungroup()

您需要提供要明确填写的日期:

data %>%
  group_by(id) %>%
  complete(date = seq(min(date), max(date), by = 1), fill = list(value = 0))
library(tidyverse)

data <- data.frame(
  id = as.character(c(rep("A1", 6), rep("B1", 5))),
  value = c(seq(1, 9, length.out = 11)),
  date = as.Date(c(
    c(
      "2015-01-01", "2015-01-02", "2015-01-03",
      "2015-01-05", "2015-01-06", "2015-01-07"
    ),
    c(
      "2015-01-02", "2015-01-03", "2015-01-05",
      "2015-01-06", "2015-01-07"
    )
  ))
)


all_dates <- seq(min(data$date), max(data$date), by = "day") %>% as.character()

# complete all dates for each id
data %>%
  as_tibble() %>%
  group_by(id) %>%
  mutate(date = date %>% as.character() %>% factor(levels = all_dates)) %>%
  complete(date, fill = list(value = 0)) %>%
  mutate(date = date %>% as.Date())
#> # A tibble: 14 × 3
#> # Groups:   id [2]
#>    id    date       value
#>    <chr> <date>     <dbl>
#>  1 A1    2015-01-01   1  
#>  2 A1    2015-01-02   1.8
#>  3 A1    2015-01-03   2.6
#>  4 A1    2015-01-04   0  
#>  5 A1    2015-01-05   3.4
#>  6 A1    2015-01-06   4.2
#>  7 A1    2015-01-07   5  
#>  8 B1    2015-01-01   0  
#>  9 B1    2015-01-02   5.8
#> 10 B1    2015-01-03   6.6
#> 11 B1    2015-01-04   0  
#> 12 B1    2015-01-05   7.4
#> 13 B1    2015-01-06   8.2
#> 14 B1    2015-01-07   9

reprex package (v2.0.0)

创建于 2022-04-01

这不是很优雅,但很管用。

data.frame(date = rep(dates, length(id)),
           id = rep(ids, each = length(dates))) |> 
        full_join(data) |>
        arrange(id, date) |>
        group_by(id) |>
        filter(!is.na(value) | row_number() > 1) |> 
        mutate(value = replace_na(value, 0)) |> 
        ungroup()