在数据框 R 中的日期时间字段上滑动 window
Sliding window on date-time field in data frame R
我想计算一些描述性统计数据:
- 按特定客户(例如,
customer_id
字段)
- 滑动时间window(例如,24 小时)
- 不包括当前值(仅使用 期间的先前值)
数据框示例:
dt <- tibble(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00") # 4th customer
) %>%
mutate(
date = lubridate::ymd_hm(date)
)
预期结果:
order_id customer_id amount date amount__n__last_24h amount__mean__last_24h
<int> <dbl> <dbl> <dttm> <dbl> <dbl>
1 1 10 2020-10-07 12:00:00 0 NA
2 2 20 2020-10-07 12:00:00 0 NA
3 2 30 2020-10-08 11:00:00 1 20
4 3 40 2020-10-07 12:00:00 0 NA
5 3 50 2020-10-08 11:00:00 1 40
6 3 60 2020-10-08 20:00:00 1 50
7 4 70 2020-10-07 12:00:00 0 NA
8 4 80 2020-10-08 11:00:00 1 70
9 4 90 2020-10-08 20:00:00 1 80
10 4 100 2020-10-08 21:00:00 2 85
我该怎么做?
这是一个使用 data.table
的选项:
dt[, dayago := date - 24 * 60 * 60]
dt[, c("n", "avg") :=
dt[dt, on=.(customer_id, date>=dayago, date<date),
by=.EACHI, .(n=.N, avg=mean(amount))][, (1L:3L) := NULL]
]
数据:
library(data.table)
dt <- data.table(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = as.POSIXct(c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00" # 4th customer
), format=("%Y-%m-%d %H:%M")))
我想计算一些描述性统计数据:
- 按特定客户(例如,
customer_id
字段) - 滑动时间window(例如,24 小时)
- 不包括当前值(仅使用 期间的先前值)
数据框示例:
dt <- tibble(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00") # 4th customer
) %>%
mutate(
date = lubridate::ymd_hm(date)
)
预期结果:
order_id customer_id amount date amount__n__last_24h amount__mean__last_24h
<int> <dbl> <dbl> <dttm> <dbl> <dbl>
1 1 10 2020-10-07 12:00:00 0 NA
2 2 20 2020-10-07 12:00:00 0 NA
3 2 30 2020-10-08 11:00:00 1 20
4 3 40 2020-10-07 12:00:00 0 NA
5 3 50 2020-10-08 11:00:00 1 40
6 3 60 2020-10-08 20:00:00 1 50
7 4 70 2020-10-07 12:00:00 0 NA
8 4 80 2020-10-08 11:00:00 1 70
9 4 90 2020-10-08 20:00:00 1 80
10 4 100 2020-10-08 21:00:00 2 85
我该怎么做?
这是一个使用 data.table
的选项:
dt[, dayago := date - 24 * 60 * 60]
dt[, c("n", "avg") :=
dt[dt, on=.(customer_id, date>=dayago, date<date),
by=.EACHI, .(n=.N, avg=mean(amount))][, (1L:3L) := NULL]
]
数据:
library(data.table)
dt <- data.table(
order_id = 1:10,
customer_id = c(1, rep(2, 2), rep(3, 3), rep(4, 4)),
amount = seq(10, 100, by = 10),
date = as.POSIXct(c("2020-10-07 12:00", # 1st customer
"2020-10-07 12:00", "2020-10-08 11:00", # 2st customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", # 3rd customer
"2020-10-07 12:00", "2020-10-08 11:00", "2020-10-08 20:00", "2020-10-08 21:00" # 4th customer
), format=("%Y-%m-%d %H:%M")))