如何计算 R 中的 LAG 差异 - 如果周期发生变化或如何将空 NA 行添加到数据帧?

How to calculate LAG difference in R - if periods changes or how to add empty NA rows to dataframe?

如果我有以下数据框:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)

并且想找到值的滞后,以便我生成以下数据帧:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30), 
  output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)

时间段不同 - 一个明显滞后的时间段是 5 - 那些是 Q1、Q2、Q3、Q4、END,然后是第二个时间段 - 那些是 year_priorEND vs year_aheadEND 甚至更远。

或者:

相反,如果没有网站的周期为 5(意味着 Q1、Q2、Q3 存在 5 个值, Q4,END) 然后生成该网站和期间的其余行,但值为 NA,因此可以生成如下内容:

tibble(
  period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube"
  ),
  values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30), 
  output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)

因此,如果没有明确编码出哪些字段需要估算 - 我假设可以对每个组进行某种形式的检查? 因为在这种情况下我们可以只使用 output = lag(values, 5) 因为周期都是一致的

我觉得youtube 的输出应该是20?如果是这样,这是一个可能的解决方案,但是很笨拙:

library(tidyverse)
df <- tibble( period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
             website = c( "google", "google", "facebook",
                          "facebook", "facebook", "facebook",
                          "facebook", "facebook", "facebook",
                          "facebook", "facebook", "facebook",
                          "youtube", "youtube" ),
             values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30) )

df %>% 
  extract(period, into = c("Year", "Period"), regex = "([0-9]*)(.*)") %>% 
  group_by(website) %>% 
  mutate(group_rank = match(Year, unique(Year)),
         intstep = if_else(lag(Period) == "END" & group_rank == 2, lag(values), 0)) %>% 
  group_by(website, group_rank) %>% 
  mutate(outcome = if_else(group_rank == 2, first(intstep), as.double(NA))) %>% 
  ungroup() %>% 
  select(-c(group_rank,intstep))
#> # A tibble: 14 × 5
#>    Year  Period website  values outcome
#>    <chr> <chr>  <chr>     <dbl>   <dbl>
#>  1 2010  END    google        1      NA
#>  2 2011  END    google        2       1
#>  3 2010  Q1     facebook      1      NA
#>  4 2010  Q2     facebook      2      NA
#>  5 2010  Q3     facebook      3      NA
#>  6 2010  Q4     facebook      4      NA
#>  7 2010  END    facebook      5      NA
#>  8 2011  Q1     facebook      6       5
#>  9 2011  Q2     facebook      7       5
#> 10 2011  Q3     facebook      8       5
#> 11 2011  Q4     facebook      9       5
#> 12 2011  END    facebook     10       5
#> 13 2011  END    youtube      20      NA
#> 14 2012  END    youtube      30      20

这个管道可能会更整洁、更简单,我愿意接受建议。

如果我理解正确,OP 想要计算每个网站的某个时期和上一年同期的 year-on-year 差异。有一个相关的问题 ,其中 OP 明确要求计算 2011Q1 - 2010Q1 等等,包括 2011END - 2010END.

使用lag()只有在时间序列完整且要滞后的位置数始终恒定的情况下才有效。对于给定的数据集,情况并非如此。

因此,我建议使用 更新 self-join:

library(data.table)
setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\d{4})", perl = TRUE, 
                                           type.convert = TRUE)][
                                             , prior_year := year - 1L]
inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]
     period  website values year qtr prior_year output
 1: 2010END   google      1 2010 END       2009     NA
 2: 2011END   google      2 2011 END       2010      1
 3:  2010Q1 facebook      1 2010  Q1       2009     NA
 4:  2010Q2 facebook      2 2010  Q2       2009     NA
 5:  2010Q3 facebook      3 2010  Q3       2009     NA
 6:  2010Q4 facebook      4 2010  Q4       2009     NA
 7: 2010END facebook      5 2010 END       2009     NA
 8:  2011Q1 facebook      6 2011  Q1       2010      5
 9:  2011Q2 facebook      7 2011  Q2       2010      5
10:  2011Q3 facebook      8 2011  Q3       2010      5
11:  2011Q4 facebook      9 2011  Q4       2010      5
12: 2011END facebook     10 2011 END       2010      5
13: 2011END  youtube     20 2011 END       2010     NA
14: 2012END  youtube     30 2012 END       2011     10

说明

  1. period 分为两部分,yearqtryear 被强制键入 integer。用于拆分的 正则表达式 使用 正向回顾 zero-length 断言 在前四位数字之后立即拆分字符串。
  2. prior_year 被计算并附加到数据集。
  3. 更新self-join,对于相同的websiteqtryear与[=18=匹配].计算值的差异并将其附加为新列 output.