计算 R 中数据帧的滞后?

Calculating Lag for dataframe in R?

如果我在 R 中有以下 dataframe/tibble:

tibble(
  period =
    c(
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END",
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook"
  ), 
  values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1)
)

我如何对前一年的每个期间值执行滞后计算,例如,我想创建 2011Q1 - 2010Q1 等的计算,包括 2011END - 2010END

所以我得到一个 table 如下所示:

tibble(
  period =
    c(
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END",
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook"
  ), 
  values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1), 
  calculation = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,14,12,10,8,6,4,2,0,-2,-4,-6,-8,-10,-12,-14))

这里有道理,我们无法与之前的时期进行比较,因此它是 NA - 对于 2011 年,所有值都是这样计算的:

尝试按期间列对数据进行分组时,使用 lag() 函数时遇到一些问题。

如果每年分为 5 个时期,lag(...,5) 应该取您正在计算的前 5 行的值。

example %>% 
  mutate(calculation = values - lag(values,5))

输出:

# A tibble: 30 x 3
   period  values calculation
   <chr>    <dbl> <dbl>
 1 2010Q1       1    NA
 2 2010Q2       2    NA
 3 2010Q3       3    NA
 4 2010Q4       4    NA
 5 2010END      5    NA
 6 2011Q1       6     5
 7 2011Q2       7     5
 8 2011Q3       8     5
 9 2011Q4       9     5
10 2011END     10     5
# ... with 20 more rows

编辑:正如@AndrewGB 准确地说的那样,group_by(website) 必须添加到每个网站的单独操作中。另外,我假设行已经按句点排列。

example %>%
group_by(website) %>%
  mutate(calculation = values - lag(values,5))

这是另一个使用data.table的选项,我们可以使用shift从上一年计算匹配值(例如2011Q1 - 2010Q1)。我还假设您想为每个网站都这样做,所以我添加了一个分组。

library(data.table)

setDT(dt)[, calculation :=  values - shift(values, n = 5, type = "lag"), by = website]

输出

      period  website values calculation
 1:   2010Q1   google      1          NA
 2:   2010Q2   google      2          NA
 3:   2010Q3   google      3          NA
 4:   2010Q4   google      4          NA
 5:  2010END   google      5          NA
 6:   2011Q1   google      6           5
 7:   2011Q2   google      7           5
 8:   2011Q3   google      8           5
 9:   2011Q4   google      9           5
10:  2011END   google     10           5
11:   2012Q1   google     11           5
12:   2012Q2   google     12           5
13:   2012Q3   google     13           5
14:   2012Q4   google     14           5
15: 20120END   google     15           5
16:   2010Q1 facebook     15          NA
17:   2010Q2 facebook     14          NA
18:   2010Q3 facebook     13          NA
19:   2010Q4 facebook     12          NA
20:  2010END facebook     11          NA
21:   2011Q1 facebook     10          -5
22:   2011Q2 facebook      9          -5
23:   2011Q3 facebook      8          -5
24:   2011Q4 facebook      7          -5
25:  2011END facebook      6          -5
26:   2012Q1 facebook      5          -5
27:   2012Q2 facebook      4          -5
28:   2012Q3 facebook      3          -5
29:   2012Q4 facebook      2          -5
30: 20120END facebook      1          -5

如果您在面板回归的背景下而不是,那么dplyr::lag 将不是您的最佳选择。这是因为它线性地将列中的先前元素作为滞后。

我已经使用 pivot_wider 执行这些滞后,然后 pivot_longer 重建原始数据框加上滞后变量。您可能还需要将日期列分成年份和季度列。完成后,您应该能够计算出正确的滞后。稍后我会 post 提供相应的代码。