如何根据 activity 的前 7 天为每个用户创建滚动平均值?
How can I create a rolling mean for each user based on previous 7 days of activity?
我一直在查看过去的帖子,但似乎找不到符合我需要的内容。
目标:对于每个用户,我想要他们前 7 天活动的平均值(不包括当前观察)。有些人在那 window 期间没有活动(这没关系),其他人会有很多活动。
我一直在使用 dplyr 按用户分组,但无法弄清楚如何获取每个时间戳并捕获该时间戳之前所有活动的平均值以获得每个人的滚动平均值。这是一个大型数据集,因此需要高效。我确信 datatable 可以做到这一点,但我发现代码很难解释,即使它更快。
User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900
需要变成这样。我们有每个用户基于时间戳的滚动比例和滚动均值,不包括时间戳观察。
User Stamp activity Score proportion_walk mean_score
1 2019-06-20 "Car" 4500 .5 625
1 2019-06-18 "Car" 600 1 650
1 2019-06-15 "Walk" 650 0 0
1 2019-06-21 "Ride" 790 .33 1916.33
2 2019-06-21 "Car" 800 0 0
2 2019-06-23 "Car" 500 0 800
3 2019-06-11 "Walk" 900 0 0
4 2019-06-15 "Walk" 200 1 900
4 2019-06-12 "Walk" 900 1 900
使用末尾注释中的数据对指定条件执行左自连接,对所有匹配行取 Score
的平均值,如果 none.
则取 0
library(sqldf)
sqldf("select a.*,
coalesce(avg(b.Activity == 'Walk'), 0) as Proportion_Walk,
coalesce(avg(b.Score), 0) as Mean
from DF as a
left join DF as b on a.User = b.User and
b.Stamp < a.Stamp and b.Stamp >= a.Stamp - 7
group by a.rowid")
给予:
User Stamp activity Score Proportion_Walk Mean
1 1 2019-06-20 Car 4500 0.5000000 625.000
2 1 2019-06-18 Car 600 1.0000000 650.000
3 1 2019-06-15 Walk 650 0.0000000 0.000
4 1 2019-06-21 Ride 790 0.3333333 1916.667
5 2 2019-06-21 Car 800 0.0000000 0.000
6 2 2019-06-23 Car 500 0.0000000 800.000
7 3 2019-06-11 Walk 900 0.0000000 0.000
8 4 2019-06-15 Walk 200 1.0000000 900.000
9 4 2019-06-12 Walk 900 0.0000000 0.000
备注
可重现形式的数据:
Lines <- 'User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900'
DF <- read.table(text = Lines, header = TRUE)
DF$Stamp <- as.Date(DF$Stamp)
可以试试:
library(data.table)
df <- setDT(df)[, Stamp := as.Date(Stamp)][
, `:=` (mean_score = sapply(Stamp,
function(x)
mean(Score[between(Stamp, x - 7, x - 1)])
),
proportion_walk = sapply(Stamp,
function(x)
round(mean(
activity[between(Stamp, x - 7, x - 1)] == 'Walk'
),2)
)
), by = User][
is.nan(mean_score), `:=` (mean_score = 0, proportion_walk = 0)]
输出:
User Stamp activity Score mean_score proportion_walk
1: 1 2019-06-20 Car 4500 625.000 0.50
2: 1 2019-06-18 Car 600 650.000 1.00
3: 1 2019-06-15 Walk 650 0.000 0.00
4: 1 2019-06-21 Ride 790 1916.667 0.33
5: 2 2019-06-21 Car 800 0.000 0.00
6: 2 2019-06-23 Car 500 800.000 0.00
7: 3 2019-06-11 Walk 900 0.000 0.00
8: 4 2019-06-15 Walk 200 900.000 1.00
9: 4 2019-06-12 Walk 900 0.000 0.00
proportion_walk
,根据您的描述,我认为您的输出中有错字。否则请改写;例如,2019-06-20
不能有 0.33,因为落后 2 天,其中之一是 Walk
.
library("dplyr")
library("purr")
DF %>%
group_by(User) %>%
mutate(mean_score = map_dbl(Stamp,
~mean(Score[(Stamp > . - 7) & (Stamp < .)]))) %>%
mutate(mean_score =ifelse(is.nan(mean_score), 0, mean_score))
输出
User Stamp activity Score mean_score
<int> <date> <fct> <int> <dbl>
1 1 2019-06-20 Car 4500 625
2 1 2019-06-18 Car 600 650
3 1 2019-06-15 Walk 650 0
4 1 2019-06-21 Ride 790 1917.
5 2 2019-06-21 Car 800 0
6 2 2019-06-23 Car 500 800
7 3 2019-06-11 Walk 900 0
8 4 2019-06-15 Walk 200 900
9 4 2019-06-12 Walk 900 0
我一直在查看过去的帖子,但似乎找不到符合我需要的内容。 目标:对于每个用户,我想要他们前 7 天活动的平均值(不包括当前观察)。有些人在那 window 期间没有活动(这没关系),其他人会有很多活动。
我一直在使用 dplyr 按用户分组,但无法弄清楚如何获取每个时间戳并捕获该时间戳之前所有活动的平均值以获得每个人的滚动平均值。这是一个大型数据集,因此需要高效。我确信 datatable 可以做到这一点,但我发现代码很难解释,即使它更快。
User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900
需要变成这样。我们有每个用户基于时间戳的滚动比例和滚动均值,不包括时间戳观察。
User Stamp activity Score proportion_walk mean_score
1 2019-06-20 "Car" 4500 .5 625
1 2019-06-18 "Car" 600 1 650
1 2019-06-15 "Walk" 650 0 0
1 2019-06-21 "Ride" 790 .33 1916.33
2 2019-06-21 "Car" 800 0 0
2 2019-06-23 "Car" 500 0 800
3 2019-06-11 "Walk" 900 0 0
4 2019-06-15 "Walk" 200 1 900
4 2019-06-12 "Walk" 900 1 900
使用末尾注释中的数据对指定条件执行左自连接,对所有匹配行取 Score
的平均值,如果 none.
library(sqldf)
sqldf("select a.*,
coalesce(avg(b.Activity == 'Walk'), 0) as Proportion_Walk,
coalesce(avg(b.Score), 0) as Mean
from DF as a
left join DF as b on a.User = b.User and
b.Stamp < a.Stamp and b.Stamp >= a.Stamp - 7
group by a.rowid")
给予:
User Stamp activity Score Proportion_Walk Mean
1 1 2019-06-20 Car 4500 0.5000000 625.000
2 1 2019-06-18 Car 600 1.0000000 650.000
3 1 2019-06-15 Walk 650 0.0000000 0.000
4 1 2019-06-21 Ride 790 0.3333333 1916.667
5 2 2019-06-21 Car 800 0.0000000 0.000
6 2 2019-06-23 Car 500 0.0000000 800.000
7 3 2019-06-11 Walk 900 0.0000000 0.000
8 4 2019-06-15 Walk 200 1.0000000 900.000
9 4 2019-06-12 Walk 900 0.0000000 0.000
备注
可重现形式的数据:
Lines <- 'User Stamp activity Score
1 2019-06-20 "Car" 4500
1 2019-06-18 "Car" 600
1 2019-06-15 "Walk" 650
1 2019-06-21 "Ride" 790
2 2019-06-21 "Car" 800
2 2019-06-23 "Car" 500
3 2019-06-11 "Walk" 900
4 2019-06-15 "Walk" 200
4 2019-06-12 "Walk" 900'
DF <- read.table(text = Lines, header = TRUE)
DF$Stamp <- as.Date(DF$Stamp)
可以试试:
library(data.table)
df <- setDT(df)[, Stamp := as.Date(Stamp)][
, `:=` (mean_score = sapply(Stamp,
function(x)
mean(Score[between(Stamp, x - 7, x - 1)])
),
proportion_walk = sapply(Stamp,
function(x)
round(mean(
activity[between(Stamp, x - 7, x - 1)] == 'Walk'
),2)
)
), by = User][
is.nan(mean_score), `:=` (mean_score = 0, proportion_walk = 0)]
输出:
User Stamp activity Score mean_score proportion_walk
1: 1 2019-06-20 Car 4500 625.000 0.50
2: 1 2019-06-18 Car 600 650.000 1.00
3: 1 2019-06-15 Walk 650 0.000 0.00
4: 1 2019-06-21 Ride 790 1916.667 0.33
5: 2 2019-06-21 Car 800 0.000 0.00
6: 2 2019-06-23 Car 500 800.000 0.00
7: 3 2019-06-11 Walk 900 0.000 0.00
8: 4 2019-06-15 Walk 200 900.000 1.00
9: 4 2019-06-12 Walk 900 0.000 0.00
proportion_walk
,根据您的描述,我认为您的输出中有错字。否则请改写;例如,2019-06-20
不能有 0.33,因为落后 2 天,其中之一是 Walk
.
library("dplyr")
library("purr")
DF %>%
group_by(User) %>%
mutate(mean_score = map_dbl(Stamp,
~mean(Score[(Stamp > . - 7) & (Stamp < .)]))) %>%
mutate(mean_score =ifelse(is.nan(mean_score), 0, mean_score))
输出
User Stamp activity Score mean_score
<int> <date> <fct> <int> <dbl>
1 1 2019-06-20 Car 4500 625
2 1 2019-06-18 Car 600 650
3 1 2019-06-15 Walk 650 0
4 1 2019-06-21 Ride 790 1917.
5 2 2019-06-21 Car 800 0
6 2 2019-06-23 Car 500 800
7 3 2019-06-11 Walk 900 0
8 4 2019-06-15 Walk 200 900
9 4 2019-06-12 Walk 900 0