如何在 data.table 中创建带有约束的滞后列?
How to create a lag column with a constraint in data.table?
我的可重现数据是;
mydf <- structure(list(product = c("4689", "4695", "513377", "604018",
"4693", "513376", "4706", "4691", "4691", "1212", "601606", "4755",
"502659", "4679", "9934"), year = c(2018, 2018, 2018, 2018, 2019,
2019, 2019, 2019, 2019, 2019, 2021, 2021, 2021, 2021, 2021),
weeks = c(1, 2, 4, 5, 6, 7, 8, 9, 10, 8, 11, 12, 13, 14,
15), sales = c(18L, 13L, 16L, 10L, 11L, 16L, 20L, 11L, 20L,
12L, 10L, 14L, 14L, 19L, 15L)), row.names = c(NA, -15L), class = c("data.table",
"data.frame"))
我想计算三周前的销售均值,按照这个方法进行;
mydf[,lookup_week := weeks - 3]
lags <- mydf[,.(lag = mean(sales)),by = weeks]
joint_table <- merge(mydf,lags,by.x = 'lookup_week',by.y = 'weeks',all.x = T)
它 returns ;
lookup_week product year weeks sales lag
1: -2 4689 2018 1 18 NA
2: -1 4695 2018 2 13 NA
3: 1 513377 2018 4 16 18
4: 2 604018 2018 5 10 13
5: 3 4693 2019 6 11 NA
6: 4 513376 2019 7 16 16
7: 5 4706 2019 8 20 10
8: 5 1212 2019 8 12 10
9: 6 4691 2019 9 11 11
10: 7 4691 2019 10 20 16
11: 8 601606 2021 11 10 16
12: 9 4755 2021 12 14 11
13: 10 502659 2021 13 14 20
14: 11 4679 2021 14 19 10
15: 12 9934 2021 15 15 14
问题如下:
对于第 5 行,我需要检查 weeks
等于 3 的位置。但它不存在,我需要在 3 之前的最近一周。在这个例子中它应该是 2。如果它也不存在,我必须去 weeks
等于 1.
的地方
另一个问题是,滞后的观测年份和我计算滞后的观测年份之间最多应该有 1 年的时间。所以如果我想尽量限制回溯的话,应该最多回溯一年才能计算出lag。
我该怎么做?
dplyr
也欢迎解答
提前致谢。
I want to calculate mean of sales for three weeks before
But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.
So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.
你可以这样做:
mydf[, lag3 := .SD[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]$V1]
这给出了
> mydf[order(weeks)]
product year weeks sales lag3
1: 4689 2018 1 18 NA
2: 4695 2018 2 13 NA
3: 513377 2018 4 16 18
4: 604018 2018 5 10 13
5: 4693 2019 6 11 13
6: 513376 2019 7 16 16
7: 4706 2019 8 20 10
8: 1212 2019 8 12 10
9: 4691 2019 9 11 11
10: 4691 2019 10 20 16
11: 601606 2021 11 10 16
12: 4755 2021 12 14 11
13: 502659 2021 13 14 20
14: 4679 2021 14 19 10
15: 9934 2021 15 15 14
工作原理:
- 这是列
i=.(weeks-3)
与 x=mydf
的联接 x[i, ...]
。
- 连接条件是:最多使用 52 周前的周列。
- 平均值的聚合使用
by=.EACHI
按 i
的每一行分组。
x.sales
是在联接中找到的 x
的值。
mean(x.sales)
计算为名称为 V1
的列
尝试运行mydf[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]
我的可重现数据是;
mydf <- structure(list(product = c("4689", "4695", "513377", "604018",
"4693", "513376", "4706", "4691", "4691", "1212", "601606", "4755",
"502659", "4679", "9934"), year = c(2018, 2018, 2018, 2018, 2019,
2019, 2019, 2019, 2019, 2019, 2021, 2021, 2021, 2021, 2021),
weeks = c(1, 2, 4, 5, 6, 7, 8, 9, 10, 8, 11, 12, 13, 14,
15), sales = c(18L, 13L, 16L, 10L, 11L, 16L, 20L, 11L, 20L,
12L, 10L, 14L, 14L, 19L, 15L)), row.names = c(NA, -15L), class = c("data.table",
"data.frame"))
我想计算三周前的销售均值,按照这个方法进行;
mydf[,lookup_week := weeks - 3]
lags <- mydf[,.(lag = mean(sales)),by = weeks]
joint_table <- merge(mydf,lags,by.x = 'lookup_week',by.y = 'weeks',all.x = T)
它 returns ;
lookup_week product year weeks sales lag
1: -2 4689 2018 1 18 NA
2: -1 4695 2018 2 13 NA
3: 1 513377 2018 4 16 18
4: 2 604018 2018 5 10 13
5: 3 4693 2019 6 11 NA
6: 4 513376 2019 7 16 16
7: 5 4706 2019 8 20 10
8: 5 1212 2019 8 12 10
9: 6 4691 2019 9 11 11
10: 7 4691 2019 10 20 16
11: 8 601606 2021 11 10 16
12: 9 4755 2021 12 14 11
13: 10 502659 2021 13 14 20
14: 11 4679 2021 14 19 10
15: 12 9934 2021 15 15 14
问题如下:
对于第 5 行,我需要检查 weeks
等于 3 的位置。但它不存在,我需要在 3 之前的最近一周。在这个例子中它应该是 2。如果它也不存在,我必须去 weeks
等于 1.
另一个问题是,滞后的观测年份和我计算滞后的观测年份之间最多应该有 1 年的时间。所以如果我想尽量限制回溯的话,应该最多回溯一年才能计算出lag。
我该怎么做?
dplyr
也欢迎解答
提前致谢。
I want to calculate mean of sales for three weeks before
But it doesn't exist, I need to go closest week before 3. it should be 2 in this example. if it wouldn't exist as well, I had to go to where weeks equals to 1.
So if I want to restrict going back as much as possible, it should be able to go back at most one year to calculate lag.
你可以这样做:
mydf[, lag3 := .SD[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]$V1]
这给出了
> mydf[order(weeks)]
product year weeks sales lag3
1: 4689 2018 1 18 NA
2: 4695 2018 2 13 NA
3: 513377 2018 4 16 18
4: 604018 2018 5 10 13
5: 4693 2019 6 11 13
6: 513376 2019 7 16 16
7: 4706 2019 8 20 10
8: 1212 2019 8 12 10
9: 4691 2019 9 11 11
10: 4691 2019 10 20 16
11: 601606 2021 11 10 16
12: 4755 2021 12 14 11
13: 502659 2021 13 14 20
14: 4679 2021 14 19 10
15: 9934 2021 15 15 14
工作原理:
- 这是列
i=.(weeks-3)
与x=mydf
的联接x[i, ...]
。 - 连接条件是:最多使用 52 周前的周列。
- 平均值的聚合使用
by=.EACHI
按i
的每一行分组。 x.sales
是在联接中找到的x
的值。mean(x.sales)
计算为名称为V1
的列
尝试运行mydf[.(weeks-3), on=.(weeks), roll=52-3, mean(x.sales), by=.EACHI]