如何根据 activity 的前 7 天为每个用户创建滚动平均值?

How can I create a rolling mean for each user based on previous 7 days of activity?

我一直在查看过去的帖子,但似乎找不到符合我需要的内容。 目标:对于每个用户,我想要他们前 7 天活动的平均值(不包括当前观察)。有些人在那 window 期间没有活动(这没关系),其他人会有很多活动。

我一直在使用 dplyr 按用户分组,但无法弄清楚如何获取每个时间戳并捕获该时间戳之前所有活动的平均值以获得每个人的滚动平均值。这是一个大型数据集,因此需要高效。我确信 datatable 可以做到这一点,但我发现代码很难解释,即使它更快。

User  Stamp          activity   Score    

1     2019-06-20     "Car"      4500
1     2019-06-18     "Car"      600
1     2019-06-15     "Walk"     650
1     2019-06-21     "Ride"     790
2     2019-06-21     "Car"      800    
2     2019-06-23     "Car"      500
3     2019-06-11     "Walk"     900
4     2019-06-15     "Walk"     200   
4     2019-06-12     "Walk"     900

需要变成这样。我们有每个用户基于时间戳的滚动比例和滚动均值,不包括时间戳观察。

User  Stamp          activity   Score   proportion_walk   mean_score    

1     2019-06-20     "Car"      4500    .5                 625
1     2019-06-18     "Car"      600     1                  650
1     2019-06-15     "Walk"     650     0                   0
1     2019-06-21     "Ride"     790     .33                   1916.33
2     2019-06-21     "Car"      800     0                   0
2     2019-06-23     "Car"      500     0                   800
3     2019-06-11     "Walk"     900     0                   0
4     2019-06-15     "Walk"     200     1                   900
4     2019-06-12     "Walk"     900     1                   900

使用末尾注释中的数据对指定条件执行左自连接,对所有匹配行取 Score 的平均值,如果 none.

则取 0
library(sqldf)

sqldf("select a.*, 
         coalesce(avg(b.Activity == 'Walk'), 0) as Proportion_Walk, 
         coalesce(avg(b.Score), 0) as Mean
  from DF as a 
  left join DF as b on a.User = b.User and 
                       b.Stamp < a.Stamp and b.Stamp >= a.Stamp - 7
  group by a.rowid")

给予:

  User      Stamp activity Score Proportion_Walk     Mean
1    1 2019-06-20      Car  4500       0.5000000  625.000
2    1 2019-06-18      Car   600       1.0000000  650.000
3    1 2019-06-15     Walk   650       0.0000000    0.000
4    1 2019-06-21     Ride   790       0.3333333 1916.667
5    2 2019-06-21      Car   800       0.0000000    0.000
6    2 2019-06-23      Car   500       0.0000000  800.000
7    3 2019-06-11     Walk   900       0.0000000    0.000
8    4 2019-06-15     Walk   200       1.0000000  900.000
9    4 2019-06-12     Walk   900       0.0000000    0.000

备注

可重现形式的数据:

Lines <- 'User  Stamp          activity   Score    
1     2019-06-20     "Car"      4500
1     2019-06-18     "Car"      600
1     2019-06-15     "Walk"     650
1     2019-06-21     "Ride"     790
2     2019-06-21     "Car"      800    
2     2019-06-23     "Car"      500
3     2019-06-11     "Walk"     900
4     2019-06-15     "Walk"     200   
4     2019-06-12     "Walk"     900'

DF <- read.table(text = Lines, header = TRUE)
DF$Stamp <- as.Date(DF$Stamp)

可以试试:

library(data.table)

df <- setDT(df)[, Stamp := as.Date(Stamp)][
  , `:=` (mean_score = sapply(Stamp, 
                              function(x) 
                                mean(Score[between(Stamp, x - 7, x - 1)])
  ),
  proportion_walk = sapply(Stamp, 
                           function(x) 
                             round(mean(
                               activity[between(Stamp, x - 7, x - 1)] == 'Walk'
                               ),2)
  )
  ), by = User][
    is.nan(mean_score), `:=` (mean_score = 0, proportion_walk = 0)]

输出:

   User      Stamp activity Score mean_score proportion_walk
1:    1 2019-06-20      Car  4500    625.000            0.50
2:    1 2019-06-18      Car   600    650.000            1.00
3:    1 2019-06-15     Walk   650      0.000            0.00
4:    1 2019-06-21     Ride   790   1916.667            0.33
5:    2 2019-06-21      Car   800      0.000            0.00
6:    2 2019-06-23      Car   500    800.000            0.00
7:    3 2019-06-11     Walk   900      0.000            0.00
8:    4 2019-06-15     Walk   200    900.000            1.00
9:    4 2019-06-12     Walk   900      0.000            0.00

proportion_walk,根据您的描述,我认为您的输出中有错字。否则请改写;例如,2019-06-20 不能有 0.33,因为落后 2 天,其中之一是 Walk.

library("dplyr")
library("purr")
DF %>%
  group_by(User) %>%
  mutate(mean_score = map_dbl(Stamp, 
                              ~mean(Score[(Stamp > . - 7) & (Stamp < .)]))) %>%
  mutate(mean_score =ifelse(is.nan(mean_score), 0, mean_score))

输出

  User Stamp      activity Score mean_score
  <int> <date>     <fct>    <int>      <dbl>
1     1 2019-06-20 Car       4500       625 
2     1 2019-06-18 Car        600       650 
3     1 2019-06-15 Walk       650         0 
4     1 2019-06-21 Ride       790      1917.
5     2 2019-06-21 Car        800         0 
6     2 2019-06-23 Car        500       800 
7     3 2019-06-11 Walk       900         0 
8     4 2019-06-15 Walk       200       900 
9     4 2019-06-12 Walk       900         0