如何在 data.table 中创建带有约束的滞后列?

How to create a lag column with a constraint in data.table?

我的可重现数据是;

mydf <- structure(list(product = c("4689", "4695", "513377", "604018", 
"4693", "513376", "4706", "4691", "4691", "1212", "601606", "4755", 
"502659", "4679", "9934"), year = c(2018, 2018, 2018, 2018, 2019, 
2019, 2019, 2019, 2019, 2019, 2021, 2021, 2021, 2021, 2021), 
    weeks = c(1, 2, 4, 5, 6, 7, 8, 9, 10, 8, 11, 12, 13, 14, 
    15), sales = c(18L, 13L, 16L, 10L, 11L, 16L, 20L, 11L, 20L, 
    12L, 10L, 14L, 14L, 19L, 15L)), row.names = c(NA, -15L), class = c("data.table", 
"data.frame"))

我想计算三周前的销售均值,按照这个方法进行;

mydf[,lookup_week := weeks - 3]

lags <- mydf[,.(lag = mean(sales)),by = weeks]

joint_table <- merge(mydf,lags,by.x = 'lookup_week',by.y = 'weeks',all.x = T)

它 returns ;

    lookup_week product year weeks sales lag
 1:          -2    4689 2018     1    18  NA
 2:          -1    4695 2018     2    13  NA
 3:           1  513377 2018     4    16  18
 4:           2  604018 2018     5    10  13
 5:           3    4693 2019     6    11  NA
 6:           4  513376 2019     7    16  16
 7:           5    4706 2019     8    20  10
 8:           5    1212 2019     8    12  10
 9:           6    4691 2019     9    11  11
10:           7    4691 2019    10    20  16
11:           8  601606 2021    11    10  16
12:           9    4755 2021    12    14  11
13:          10  502659 2021    13    14  20
14:          11    4679 2021    14    19  10
15:          12    9934 2021    15    15  14

问题如下:

对于第 5 行,我需要检查 weeks 等于 3 的位置。但它不存在,我需要在 3 之前的最近一周。在这个例子中它应该是 2。如果它也不存在,我必须去 weeks 等于 1.

的地方

另一个问题是,滞后的观测年份和我计算滞后的观测年份之间最多应该有 1 年的时间。所以如果我想尽量限制回溯的话,应该最多回溯一年才能计算出lag。

我该怎么做?

dplyr也欢迎解答

提前致谢。

I want to calculate mean of sales for three weeks before

But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.

So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.

你可以这样做:

mydf[, lag3 := .SD[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]$V1]

这给出了

> mydf[order(weeks)]
    product year weeks sales lag3
 1:    4689 2018     1    18   NA
 2:    4695 2018     2    13   NA
 3:  513377 2018     4    16   18
 4:  604018 2018     5    10   13
 5:    4693 2019     6    11   13
 6:  513376 2019     7    16   16
 7:    4706 2019     8    20   10
 8:    1212 2019     8    12   10
 9:    4691 2019     9    11   11
10:    4691 2019    10    20   16
11:  601606 2021    11    10   16
12:    4755 2021    12    14   11
13:  502659 2021    13    14   20
14:    4679 2021    14    19   10
15:    9934 2021    15    15   14

工作原理:

  • 这是列 i=.(weeks-3)x=mydf 的联接 x[i, ...]
  • 连接条件是:最多使用 52 周前的周列。
  • 平均值的聚合使用 by=.EACHIi 的每一行分组。
  • x.sales 是在联接中找到的 x 的值。
  • mean(x.sales) 计算为名称为 V1
  • 的列

尝试运行mydf[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]