使用 R 数据 table 计算累计日期的罢工率
Using R data table to calculate strike rates over cumulative dates
我有一个大约 150 万行和数百列的数据 table 结构,表示带有赛马结果的日期 - 这将用于预测模型,但首先需要特征工程来计算各种实体在创造前一天每场比赛的先前记录方面的罢工率。
"Strike rate" 可以用多种方式定义,但一个简单的定义是任何给定的马匹、驯马师、骑师等的获胜次数与次数 运行 的比率。当然,这必须考虑在内所有以前的 运行 和获胜,但不包括 "today" 的结果,因为这对于构建模型来说是无稽之谈。
没关系,一个简单的数据结构,改编自一些网上的例子,就足以解释了。
生成数据如下:
n <- 90
dt <- data.table(
date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6),
finish=c(1:5),
trainer=sort(rep(letters[1:5], n/5))
)
想象一下,在这些日期,每个训练员都有一个 运行 训练员,其在比赛中的完成位置由 "finish" 表示。对于序列中的新日期(但不在此数据中),到目前为止可以计算出获胜次数的比率:
dt[order(trainer, date), .(strike_rate = sum(finish==1)/.N), by=trainer]
但是,为每个培训师显示的结果 strike_rate 变量仅对不在此数据集中的序列中的新日期有效,比如“2015-01-02”,或者我们的样本集。
要构建模型,我们需要每天和每个培训师(以及许多其他实体,但现在让我们坚持使用培训师)的罢工率。
我试过 shift
函数和数据 table 结构,但无法让它解决这个特定问题 - 然而,在循环上下文中它工作正常,但令人难以置信的显示.
为了说明所需的输出,此示例代码(尽管我确信它不够优雅!)工作正常:
#order dates most recent to oldest so that the loop works backwards in time:
dt <- dt[order(-date)]
#find unique dates (converting to character as something weird with date)
dates = as.character(unique(dt$date))
for (d in dates) {
#find unique trainers on this date
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
并给出所需的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 4 b 0.2000000
5: 2015-01-01 5 b 0.2000000
6: 2015-01-01 1 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 5 d 0.2000000
11: 2015-01-01 1 d 0.2000000
12: 2015-01-01 2 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 5 a 0.1666667
17: 2014-01-01 1 a 0.1666667
18: 2014-01-01 2 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 4 d 0.1666667
26: 2014-01-01 5 d 0.1666667
27: 2014-01-01 1 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 4 a 0.1111111
32: 2013-01-01 5 a 0.1111111
33: 2013-01-01 1 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 5 c 0.1111111
38: 2013-01-01 1 c 0.1111111
39: 2013-01-01 2 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 4 c 0.0000000
53: 2012-01-01 5 c 0.0000000
54: 2012-01-01 1 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 5 e 0.1666667
59: 2012-01-01 1 e 0.1666667
60: 2012-01-01 2 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 5 b 0.3333333
65: 2011-01-01 1 b 0.3333333
66: 2011-01-01 2 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 4 e 0.0000000
74: 2011-01-01 5 e 0.0000000
75: 2011-01-01 1 e 0.0000000
76: 2010-01-01 1 a NaN
77: 2010-01-01 2 a NaN
78: 2010-01-01 3 a NaN
79: 2010-01-01 4 b NaN
80: 2010-01-01 5 b NaN
81: 2010-01-01 1 b NaN
82: 2010-01-01 2 c NaN
83: 2010-01-01 3 c NaN
84: 2010-01-01 4 c NaN
85: 2010-01-01 5 d NaN
86: 2010-01-01 1 d NaN
87: 2010-01-01 2 d NaN
88: 2010-01-01 3 e NaN
89: 2010-01-01 4 e NaN
90: 2010-01-01 5 e NaN
在数据 table 中执行此 "properly" 的任何帮助将不胜感激。可以看出,我已经开始使用该库,但在此类问题上遇到了障碍。我理解循环的逻辑,但它在 1.5M 行上效率不高,需要对所有变量进行大量此类计算。
我认为不需要 for
循环。我在这里使用 magrittr::%>%
主要是因为我认为它有助于分解操作流程;它不是必需的,这可以很容易地转换成 data.table
-pipe 或类似的首选项。
library(data.table)
library(magrittr)
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, date := shift(date, type = "lead"), by = .(trainer) ] %>%
merge(dt, ., by = c("trainer", "date"), all.x = TRUE) %>%
.[ order(-date), ]
# trainer date finish rate
# 1: a 2015-01-01 1 0.2000000
# 2: a 2015-01-01 2 0.2000000
# 3: a 2015-01-01 3 0.2000000
# 4: b 2015-01-01 4 0.2000000
# 5: b 2015-01-01 5 0.2000000
# 6: b 2015-01-01 1 0.2000000
# 7: c 2015-01-01 2 0.2000000
# 8: c 2015-01-01 3 0.2000000
# 9: c 2015-01-01 4 0.2000000
# 10: d 2015-01-01 5 0.2000000
# 11: d 2015-01-01 1 0.2000000
# 12: d 2015-01-01 2 0.2000000
# 13: e 2015-01-01 3 0.2000000
# 14: e 2015-01-01 4 0.2000000
# 15: e 2015-01-01 5 0.2000000
# 16: a 2014-01-01 5 0.1666667
# 17: a 2014-01-01 1 0.1666667
# 18: a 2014-01-01 2 0.1666667
# 19: b 2014-01-01 3 0.2500000
# 20: b 2014-01-01 4 0.2500000
# 21: b 2014-01-01 5 0.2500000
# 22: c 2014-01-01 1 0.1666667
# 23: c 2014-01-01 2 0.1666667
# 24: c 2014-01-01 3 0.1666667
# 25: d 2014-01-01 4 0.1666667
# 26: d 2014-01-01 5 0.1666667
# 27: d 2014-01-01 1 0.1666667
# 28: e 2014-01-01 2 0.2500000
# 29: e 2014-01-01 3 0.2500000
# 30: e 2014-01-01 4 0.2500000
# 31: a 2013-01-01 4 0.1111111
# 32: a 2013-01-01 5 0.1111111
# 33: a 2013-01-01 1 0.1111111
# 34: b 2013-01-01 2 0.3333333
# 35: b 2013-01-01 3 0.3333333
# 36: b 2013-01-01 4 0.3333333
# 37: c 2013-01-01 5 0.1111111
# 38: c 2013-01-01 1 0.1111111
# 39: c 2013-01-01 2 0.1111111
# 40: d 2013-01-01 3 0.2222222
# 41: d 2013-01-01 4 0.2222222
# 42: d 2013-01-01 5 0.2222222
# 43: e 2013-01-01 1 0.2222222
# 44: e 2013-01-01 2 0.2222222
# 45: e 2013-01-01 3 0.2222222
# 46: a 2012-01-01 3 0.1666667
# 47: a 2012-01-01 4 0.1666667
# 48: a 2012-01-01 5 0.1666667
# 49: b 2012-01-01 1 0.3333333
# 50: b 2012-01-01 2 0.3333333
# 51: b 2012-01-01 3 0.3333333
# 52: c 2012-01-01 4 0.0000000
# 53: c 2012-01-01 5 0.0000000
# 54: c 2012-01-01 1 0.0000000
# 55: d 2012-01-01 2 0.3333333
# 56: d 2012-01-01 3 0.3333333
# 57: d 2012-01-01 4 0.3333333
# 58: e 2012-01-01 5 0.1666667
# 59: e 2012-01-01 1 0.1666667
# 60: e 2012-01-01 2 0.1666667
# 61: a 2011-01-01 2 0.3333333
# 62: a 2011-01-01 3 0.3333333
# 63: a 2011-01-01 4 0.3333333
# 64: b 2011-01-01 5 0.3333333
# 65: b 2011-01-01 1 0.3333333
# 66: b 2011-01-01 2 0.3333333
# 67: c 2011-01-01 3 0.0000000
# 68: c 2011-01-01 4 0.0000000
# 69: c 2011-01-01 5 0.0000000
# 70: d 2011-01-01 1 0.3333333
# 71: d 2011-01-01 2 0.3333333
# 72: d 2011-01-01 3 0.3333333
# 73: e 2011-01-01 4 0.0000000
# 74: e 2011-01-01 5 0.0000000
# 75: e 2011-01-01 1 0.0000000
# 76: a 2010-01-01 1 NA
# 77: a 2010-01-01 2 NA
# 78: a 2010-01-01 3 NA
# 79: b 2010-01-01 4 NA
# 80: b 2010-01-01 5 NA
# 81: b 2010-01-01 1 NA
# 82: c 2010-01-01 2 NA
# 83: c 2010-01-01 3 NA
# 84: c 2010-01-01 4 NA
# 85: d 2010-01-01 5 NA
# 86: d 2010-01-01 1 NA
# 87: d 2010-01-01 2 NA
# 88: e 2010-01-01 3 NA
# 89: e 2010-01-01 4 NA
# 90: e 2010-01-01 5 NA
# trainer date finish rate
其中一点是,成功率取决于尝试次数中的获胜次数。为此,
- 分组
trainer
,收集尝试次数(seq_len(.N)
)和获胜次数(cumsum(finish == 1)
);
- 按
date, trainer
分组,用最大获胜次数与最大尝试次数的比率总结每个组,确保我们有 "the end of last day";
- 改变
date
以便我们最终能够...
merge
(合并)回到原始数据,将 "last known date" 数据带到今天,因此今天的比赛不会影响今天的罢工率
临时(pre-merge
)可以有见地,显示 prevdate
(移位日期)而不是替换它,如上所述。知道这里的prevdate
是在原始数据的date
:
上加入的
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
# .[, c("rate", "n") := .(cumsum(finish == 1), .I), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, prevdate := shift(date, type = "lead"), by = .(trainer) ]
# date trainer rate prevdate
# 1: 2010-01-01 a 0.3333333 2011-01-01
# 2: 2010-01-01 b 0.3333333 2011-01-01
# 3: 2010-01-01 c 0.0000000 2011-01-01
# 4: 2010-01-01 d 0.3333333 2011-01-01
# 5: 2010-01-01 e 0.0000000 2011-01-01
# 6: 2011-01-01 a 0.1666667 2012-01-01
# 7: 2011-01-01 b 0.3333333 2012-01-01
# 8: 2011-01-01 c 0.0000000 2012-01-01
# 9: 2011-01-01 d 0.3333333 2012-01-01
# 10: 2011-01-01 e 0.1666667 2012-01-01
# 11: 2012-01-01 a 0.1111111 2013-01-01
# 12: 2012-01-01 b 0.3333333 2013-01-01
# 13: 2012-01-01 c 0.1111111 2013-01-01
# 14: 2012-01-01 d 0.2222222 2013-01-01
# 15: 2012-01-01 e 0.2222222 2013-01-01
# 16: 2013-01-01 a 0.1666667 2014-01-01
# 17: 2013-01-01 b 0.2500000 2014-01-01
# 18: 2013-01-01 c 0.1666667 2014-01-01
# 19: 2013-01-01 d 0.1666667 2014-01-01
# 20: 2013-01-01 e 0.2500000 2014-01-01
# 21: 2014-01-01 a 0.2000000 2015-01-01
# 22: 2014-01-01 b 0.2000000 2015-01-01
# 23: 2014-01-01 c 0.2000000 2015-01-01
# 24: 2014-01-01 d 0.2000000 2015-01-01
# 25: 2014-01-01 e 0.2000000 2015-01-01
# 26: 2015-01-01 a 0.2222222 <NA> ### data this point and below are "lost"
# 27: 2015-01-01 b 0.2222222 <NA> ### when merged, because there are no
# 28: 2015-01-01 c 0.1666667 <NA> ### dates after it to join onto
# 29: 2015-01-01 d 0.2222222 <NA>
# 30: 2015-01-01 e 0.1666667 <NA>
# date trainer rate prevdate
由于您本质上需要分组窗口功能,请考虑 split.data.table
(不要与 base::split
混淆),以在一个循环中处理 date/trainer 个子集:
setindex(dt, date, trainer) # ADD FOR OTHER GROUPS
strike_rates_dt <- split(dt, by=c("date", "trainer")) # ADD FOR OTHER GROUPS
strike_rates_dt <- lapply(strike_rates_dt, function(sub) {
t <- sub$trainer[[1]] # ADD FOR OTHER GROUPS
d <- sub$date[[1]]
trainer_past_form <- dt[trainer==t & date < d] # ADD FOR OTHER GROUPS
sr <- sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
sub[, strike_rate := sr] # SAVE AS NEW COLUMN
})
final_dt <- rbindlist(strike_rates_dt)[order(-date)]
时间表明嵌套 for
循环方法存在明显差异:
方法
op_proc <- function() {
dt <- dt[order(-date)]
dates = as.character(unique(dt$date))
for (d in dates) {
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
return(dt)
}
my_proc <- function() {
strike_rates_dt <- split(dt, by=c("date", "trainer"))
strike_rates_dt <- lapply(strike_rates_dt, function(sub) {
t <- sub$trainer[[1]]
d <- sub$date[[1]]
trainer_past_form <- dt[trainer==t & date < d]
sr <- sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
sub[, strike_rate := sr]
})
final_dt <- rbindlist(strike_rates_dt)[order(-date)]
}
n = 90
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 57.02562 59.13524 60.13463 59.73631 60.56061 77.34649 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 46.11871 46.67702 48.891 48.67245 49.64088 59.61806 100
n = 900
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 58.07979 59.83595 62.24291 60.26232 60.73125 229.4492 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 45.06198 47.09655 48.00078 47.40018 47.93625 53.7639 100
n = 9000
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 66.31556 67.07828 68.20643 67.32226 68.23552 82.22218 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 50.05955 51.42313 52.81052 51.73318 54.23603 61.34065 100
n = 90000
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 134.3456 137.7812 148.0204 139.4907 142.4315 356.7175 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 87.33779 91.21512 105.1705 92.20642 94.82666 269.798 100
这里有一些选项。
1) 使用非相等连接:
dt[, strike_rate :=
.SD[.SD, on=.(trainer, date<date), by=.EACHI, sum(finish==1L)/.N]$V1
]
2)另一个应该更快的选项:
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- replace(rep(NA_real_, .N), firstd, cumsum(finish==1L)[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, as.double(1:.N)[firstd - 1L])
nafill(cs, "locf") / nafill(k, "locf")
}, trainer]
setorder(dt, -date, trainer, finish)[]
的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 1 b 0.2000000
5: 2015-01-01 4 b 0.2000000
6: 2015-01-01 5 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 1 d 0.2000000
11: 2015-01-01 2 d 0.2000000
12: 2015-01-01 5 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 1 a 0.1666667
17: 2014-01-01 2 a 0.1666667
18: 2014-01-01 5 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 1 d 0.1666667
26: 2014-01-01 4 d 0.1666667
27: 2014-01-01 5 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 1 a 0.1111111
32: 2013-01-01 4 a 0.1111111
33: 2013-01-01 5 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 1 c 0.1111111
38: 2013-01-01 2 c 0.1111111
39: 2013-01-01 5 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 1 c 0.0000000
53: 2012-01-01 4 c 0.0000000
54: 2012-01-01 5 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 1 e 0.1666667
59: 2012-01-01 2 e 0.1666667
60: 2012-01-01 5 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 1 b 0.3333333
65: 2011-01-01 2 b 0.3333333
66: 2011-01-01 5 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 1 e 0.0000000
74: 2011-01-01 4 e 0.0000000
75: 2011-01-01 5 e 0.0000000
76: 2010-01-01 1 a NA
77: 2010-01-01 2 a NA
78: 2010-01-01 3 a NA
79: 2010-01-01 1 b NA
80: 2010-01-01 4 b NA
81: 2010-01-01 5 b NA
82: 2010-01-01 2 c NA
83: 2010-01-01 3 c NA
84: 2010-01-01 4 c NA
85: 2010-01-01 1 d NA
86: 2010-01-01 2 d NA
87: 2010-01-01 5 d NA
88: 2010-01-01 3 e NA
89: 2010-01-01 4 e NA
90: 2010-01-01 5 e NA
date finish trainer strike_rate
3) 如果 OP 可以接受第二种方法,这里是将 by=trainer
带入 j
:)
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- cumsum(finish==1L)
cumfinishes <- replace(rep(NA_real_, .N), firstd, cs[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, rowid(trainer)[firstd - 1L])
newt <- which(trainer != shift(trainer))
prevTrainer <- replace(rep(NA_real_, .N), newt, cs[newt - 1L])
finishes <- cumfinishes - nafill(replace(prevTrainer, 1L, 0), "locf")
finishes <- replace(finishes, newt, NaN)
nafill(finishes, "locf") / nafill(k, "locf")
}]
4) 同样的想法使用 Rcpp
这应该是 最快的 并且也更具可读性:
library(Rcpp)
cppFunction("
NumericVector strike(IntegerVector date, IntegerVector finish, IntegerVector trainer) {
int i, sz = date.size();
double cumstrikes = 0, prevcs = NA_REAL, days = 1, prevdays = 1;
NumericVector strikes(sz), ndays(sz);
for (i = 0; i < sz; i++) {
strikes[i] = NA_REAL;
}
if (finish[0] == 1)
cumstrikes = 1;
for (i = 1; i < sz; i++) {
if (trainer[i-1] != trainer[i]) {
cumstrikes = 0;
days = 0;
} else if (date[i-1] != date[i]) {
strikes[i] = cumstrikes;
ndays[i] = days;
} else {
strikes[i] = strikes[i-1];
ndays[i] = ndays[i-1];
}
if (finish[i] == 1) {
cumstrikes++;
}
days++;
}
for (i = 0; i < sz; i++) {
strikes[i] /= ndays[i];
}
return strikes;
}")
dt[order(trainer, date), strike_rate := strike(date, finish, rleid(trainer))]
我有一个大约 150 万行和数百列的数据 table 结构,表示带有赛马结果的日期 - 这将用于预测模型,但首先需要特征工程来计算各种实体在创造前一天每场比赛的先前记录方面的罢工率。
"Strike rate" 可以用多种方式定义,但一个简单的定义是任何给定的马匹、驯马师、骑师等的获胜次数与次数 运行 的比率。当然,这必须考虑在内所有以前的 运行 和获胜,但不包括 "today" 的结果,因为这对于构建模型来说是无稽之谈。
没关系,一个简单的数据结构,改编自一些网上的例子,就足以解释了。
生成数据如下:
n <- 90
dt <- data.table(
date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6),
finish=c(1:5),
trainer=sort(rep(letters[1:5], n/5))
)
想象一下,在这些日期,每个训练员都有一个 运行 训练员,其在比赛中的完成位置由 "finish" 表示。对于序列中的新日期(但不在此数据中),到目前为止可以计算出获胜次数的比率:
dt[order(trainer, date), .(strike_rate = sum(finish==1)/.N), by=trainer]
但是,为每个培训师显示的结果 strike_rate 变量仅对不在此数据集中的序列中的新日期有效,比如“2015-01-02”,或者我们的样本集。
要构建模型,我们需要每天和每个培训师(以及许多其他实体,但现在让我们坚持使用培训师)的罢工率。
我试过 shift
函数和数据 table 结构,但无法让它解决这个特定问题 - 然而,在循环上下文中它工作正常,但令人难以置信的显示.
为了说明所需的输出,此示例代码(尽管我确信它不够优雅!)工作正常:
#order dates most recent to oldest so that the loop works backwards in time:
dt <- dt[order(-date)]
#find unique dates (converting to character as something weird with date)
dates = as.character(unique(dt$date))
for (d in dates) {
#find unique trainers on this date
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
并给出所需的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 4 b 0.2000000
5: 2015-01-01 5 b 0.2000000
6: 2015-01-01 1 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 5 d 0.2000000
11: 2015-01-01 1 d 0.2000000
12: 2015-01-01 2 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 5 a 0.1666667
17: 2014-01-01 1 a 0.1666667
18: 2014-01-01 2 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 4 d 0.1666667
26: 2014-01-01 5 d 0.1666667
27: 2014-01-01 1 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 4 a 0.1111111
32: 2013-01-01 5 a 0.1111111
33: 2013-01-01 1 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 5 c 0.1111111
38: 2013-01-01 1 c 0.1111111
39: 2013-01-01 2 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 4 c 0.0000000
53: 2012-01-01 5 c 0.0000000
54: 2012-01-01 1 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 5 e 0.1666667
59: 2012-01-01 1 e 0.1666667
60: 2012-01-01 2 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 5 b 0.3333333
65: 2011-01-01 1 b 0.3333333
66: 2011-01-01 2 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 4 e 0.0000000
74: 2011-01-01 5 e 0.0000000
75: 2011-01-01 1 e 0.0000000
76: 2010-01-01 1 a NaN
77: 2010-01-01 2 a NaN
78: 2010-01-01 3 a NaN
79: 2010-01-01 4 b NaN
80: 2010-01-01 5 b NaN
81: 2010-01-01 1 b NaN
82: 2010-01-01 2 c NaN
83: 2010-01-01 3 c NaN
84: 2010-01-01 4 c NaN
85: 2010-01-01 5 d NaN
86: 2010-01-01 1 d NaN
87: 2010-01-01 2 d NaN
88: 2010-01-01 3 e NaN
89: 2010-01-01 4 e NaN
90: 2010-01-01 5 e NaN
在数据 table 中执行此 "properly" 的任何帮助将不胜感激。可以看出,我已经开始使用该库,但在此类问题上遇到了障碍。我理解循环的逻辑,但它在 1.5M 行上效率不高,需要对所有变量进行大量此类计算。
我认为不需要 for
循环。我在这里使用 magrittr::%>%
主要是因为我认为它有助于分解操作流程;它不是必需的,这可以很容易地转换成 data.table
-pipe 或类似的首选项。
library(data.table)
library(magrittr)
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, date := shift(date, type = "lead"), by = .(trainer) ] %>%
merge(dt, ., by = c("trainer", "date"), all.x = TRUE) %>%
.[ order(-date), ]
# trainer date finish rate
# 1: a 2015-01-01 1 0.2000000
# 2: a 2015-01-01 2 0.2000000
# 3: a 2015-01-01 3 0.2000000
# 4: b 2015-01-01 4 0.2000000
# 5: b 2015-01-01 5 0.2000000
# 6: b 2015-01-01 1 0.2000000
# 7: c 2015-01-01 2 0.2000000
# 8: c 2015-01-01 3 0.2000000
# 9: c 2015-01-01 4 0.2000000
# 10: d 2015-01-01 5 0.2000000
# 11: d 2015-01-01 1 0.2000000
# 12: d 2015-01-01 2 0.2000000
# 13: e 2015-01-01 3 0.2000000
# 14: e 2015-01-01 4 0.2000000
# 15: e 2015-01-01 5 0.2000000
# 16: a 2014-01-01 5 0.1666667
# 17: a 2014-01-01 1 0.1666667
# 18: a 2014-01-01 2 0.1666667
# 19: b 2014-01-01 3 0.2500000
# 20: b 2014-01-01 4 0.2500000
# 21: b 2014-01-01 5 0.2500000
# 22: c 2014-01-01 1 0.1666667
# 23: c 2014-01-01 2 0.1666667
# 24: c 2014-01-01 3 0.1666667
# 25: d 2014-01-01 4 0.1666667
# 26: d 2014-01-01 5 0.1666667
# 27: d 2014-01-01 1 0.1666667
# 28: e 2014-01-01 2 0.2500000
# 29: e 2014-01-01 3 0.2500000
# 30: e 2014-01-01 4 0.2500000
# 31: a 2013-01-01 4 0.1111111
# 32: a 2013-01-01 5 0.1111111
# 33: a 2013-01-01 1 0.1111111
# 34: b 2013-01-01 2 0.3333333
# 35: b 2013-01-01 3 0.3333333
# 36: b 2013-01-01 4 0.3333333
# 37: c 2013-01-01 5 0.1111111
# 38: c 2013-01-01 1 0.1111111
# 39: c 2013-01-01 2 0.1111111
# 40: d 2013-01-01 3 0.2222222
# 41: d 2013-01-01 4 0.2222222
# 42: d 2013-01-01 5 0.2222222
# 43: e 2013-01-01 1 0.2222222
# 44: e 2013-01-01 2 0.2222222
# 45: e 2013-01-01 3 0.2222222
# 46: a 2012-01-01 3 0.1666667
# 47: a 2012-01-01 4 0.1666667
# 48: a 2012-01-01 5 0.1666667
# 49: b 2012-01-01 1 0.3333333
# 50: b 2012-01-01 2 0.3333333
# 51: b 2012-01-01 3 0.3333333
# 52: c 2012-01-01 4 0.0000000
# 53: c 2012-01-01 5 0.0000000
# 54: c 2012-01-01 1 0.0000000
# 55: d 2012-01-01 2 0.3333333
# 56: d 2012-01-01 3 0.3333333
# 57: d 2012-01-01 4 0.3333333
# 58: e 2012-01-01 5 0.1666667
# 59: e 2012-01-01 1 0.1666667
# 60: e 2012-01-01 2 0.1666667
# 61: a 2011-01-01 2 0.3333333
# 62: a 2011-01-01 3 0.3333333
# 63: a 2011-01-01 4 0.3333333
# 64: b 2011-01-01 5 0.3333333
# 65: b 2011-01-01 1 0.3333333
# 66: b 2011-01-01 2 0.3333333
# 67: c 2011-01-01 3 0.0000000
# 68: c 2011-01-01 4 0.0000000
# 69: c 2011-01-01 5 0.0000000
# 70: d 2011-01-01 1 0.3333333
# 71: d 2011-01-01 2 0.3333333
# 72: d 2011-01-01 3 0.3333333
# 73: e 2011-01-01 4 0.0000000
# 74: e 2011-01-01 5 0.0000000
# 75: e 2011-01-01 1 0.0000000
# 76: a 2010-01-01 1 NA
# 77: a 2010-01-01 2 NA
# 78: a 2010-01-01 3 NA
# 79: b 2010-01-01 4 NA
# 80: b 2010-01-01 5 NA
# 81: b 2010-01-01 1 NA
# 82: c 2010-01-01 2 NA
# 83: c 2010-01-01 3 NA
# 84: c 2010-01-01 4 NA
# 85: d 2010-01-01 5 NA
# 86: d 2010-01-01 1 NA
# 87: d 2010-01-01 2 NA
# 88: e 2010-01-01 3 NA
# 89: e 2010-01-01 4 NA
# 90: e 2010-01-01 5 NA
# trainer date finish rate
其中一点是,成功率取决于尝试次数中的获胜次数。为此,
- 分组
trainer
,收集尝试次数(seq_len(.N)
)和获胜次数(cumsum(finish == 1)
); - 按
date, trainer
分组,用最大获胜次数与最大尝试次数的比率总结每个组,确保我们有 "the end of last day"; - 改变
date
以便我们最终能够... merge
(合并)回到原始数据,将 "last known date" 数据带到今天,因此今天的比赛不会影响今天的罢工率
临时(pre-merge
)可以有见地,显示 prevdate
(移位日期)而不是替换它,如上所述。知道这里的prevdate
是在原始数据的date
:
dt %>%
.[ order(date), ] %>%
.[, c("rate", "n") := .(cumsum(finish == 1), seq_len(.N)), by = .(trainer) ] %>%
# .[, c("rate", "n") := .(cumsum(finish == 1), .I), by = .(trainer) ] %>%
.[, .(rate = max(rate) / max(n)), by = .(date, trainer) ] %>%
.[, prevdate := shift(date, type = "lead"), by = .(trainer) ]
# date trainer rate prevdate
# 1: 2010-01-01 a 0.3333333 2011-01-01
# 2: 2010-01-01 b 0.3333333 2011-01-01
# 3: 2010-01-01 c 0.0000000 2011-01-01
# 4: 2010-01-01 d 0.3333333 2011-01-01
# 5: 2010-01-01 e 0.0000000 2011-01-01
# 6: 2011-01-01 a 0.1666667 2012-01-01
# 7: 2011-01-01 b 0.3333333 2012-01-01
# 8: 2011-01-01 c 0.0000000 2012-01-01
# 9: 2011-01-01 d 0.3333333 2012-01-01
# 10: 2011-01-01 e 0.1666667 2012-01-01
# 11: 2012-01-01 a 0.1111111 2013-01-01
# 12: 2012-01-01 b 0.3333333 2013-01-01
# 13: 2012-01-01 c 0.1111111 2013-01-01
# 14: 2012-01-01 d 0.2222222 2013-01-01
# 15: 2012-01-01 e 0.2222222 2013-01-01
# 16: 2013-01-01 a 0.1666667 2014-01-01
# 17: 2013-01-01 b 0.2500000 2014-01-01
# 18: 2013-01-01 c 0.1666667 2014-01-01
# 19: 2013-01-01 d 0.1666667 2014-01-01
# 20: 2013-01-01 e 0.2500000 2014-01-01
# 21: 2014-01-01 a 0.2000000 2015-01-01
# 22: 2014-01-01 b 0.2000000 2015-01-01
# 23: 2014-01-01 c 0.2000000 2015-01-01
# 24: 2014-01-01 d 0.2000000 2015-01-01
# 25: 2014-01-01 e 0.2000000 2015-01-01
# 26: 2015-01-01 a 0.2222222 <NA> ### data this point and below are "lost"
# 27: 2015-01-01 b 0.2222222 <NA> ### when merged, because there are no
# 28: 2015-01-01 c 0.1666667 <NA> ### dates after it to join onto
# 29: 2015-01-01 d 0.2222222 <NA>
# 30: 2015-01-01 e 0.1666667 <NA>
# date trainer rate prevdate
由于您本质上需要分组窗口功能,请考虑 split.data.table
(不要与 base::split
混淆),以在一个循环中处理 date/trainer 个子集:
setindex(dt, date, trainer) # ADD FOR OTHER GROUPS
strike_rates_dt <- split(dt, by=c("date", "trainer")) # ADD FOR OTHER GROUPS
strike_rates_dt <- lapply(strike_rates_dt, function(sub) {
t <- sub$trainer[[1]] # ADD FOR OTHER GROUPS
d <- sub$date[[1]]
trainer_past_form <- dt[trainer==t & date < d] # ADD FOR OTHER GROUPS
sr <- sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
sub[, strike_rate := sr] # SAVE AS NEW COLUMN
})
final_dt <- rbindlist(strike_rates_dt)[order(-date)]
时间表明嵌套 for
循环方法存在明显差异:
方法
op_proc <- function() {
dt <- dt[order(-date)]
dates = as.character(unique(dt$date))
for (d in dates) {
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
return(dt)
}
my_proc <- function() {
strike_rates_dt <- split(dt, by=c("date", "trainer"))
strike_rates_dt <- lapply(strike_rates_dt, function(sub) {
t <- sub$trainer[[1]]
d <- sub$date[[1]]
trainer_past_form <- dt[trainer==t & date < d]
sr <- sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
sub[, strike_rate := sr]
})
final_dt <- rbindlist(strike_rates_dt)[order(-date)]
}
n = 90
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 57.02562 59.13524 60.13463 59.73631 60.56061 77.34649 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 46.11871 46.67702 48.891 48.67245 49.64088 59.61806 100
n = 900
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 58.07979 59.83595 62.24291 60.26232 60.73125 229.4492 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 45.06198 47.09655 48.00078 47.40018 47.93625 53.7639 100
n = 9000
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 66.31556 67.07828 68.20643 67.32226 68.23552 82.22218 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 50.05955 51.42313 52.81052 51.73318 54.23603 61.34065 100
n = 90000
计时
# Unit: milliseconds
# expr min lq mean median uq max neval
# op_dt <- op_proc() 134.3456 137.7812 148.0204 139.4907 142.4315 356.7175 100
# Unit: milliseconds
# expr min lq mean median uq max neval
# my_dt <- my_proc() 87.33779 91.21512 105.1705 92.20642 94.82666 269.798 100
这里有一些选项。
1) 使用非相等连接:
dt[, strike_rate :=
.SD[.SD, on=.(trainer, date<date), by=.EACHI, sum(finish==1L)/.N]$V1
]
2)另一个应该更快的选项:
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- replace(rep(NA_real_, .N), firstd, cumsum(finish==1L)[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, as.double(1:.N)[firstd - 1L])
nafill(cs, "locf") / nafill(k, "locf")
}, trainer]
setorder(dt, -date, trainer, finish)[]
的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 1 b 0.2000000
5: 2015-01-01 4 b 0.2000000
6: 2015-01-01 5 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 1 d 0.2000000
11: 2015-01-01 2 d 0.2000000
12: 2015-01-01 5 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 1 a 0.1666667
17: 2014-01-01 2 a 0.1666667
18: 2014-01-01 5 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 1 d 0.1666667
26: 2014-01-01 4 d 0.1666667
27: 2014-01-01 5 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 1 a 0.1111111
32: 2013-01-01 4 a 0.1111111
33: 2013-01-01 5 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 1 c 0.1111111
38: 2013-01-01 2 c 0.1111111
39: 2013-01-01 5 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 1 c 0.0000000
53: 2012-01-01 4 c 0.0000000
54: 2012-01-01 5 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 1 e 0.1666667
59: 2012-01-01 2 e 0.1666667
60: 2012-01-01 5 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 1 b 0.3333333
65: 2011-01-01 2 b 0.3333333
66: 2011-01-01 5 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 1 e 0.0000000
74: 2011-01-01 4 e 0.0000000
75: 2011-01-01 5 e 0.0000000
76: 2010-01-01 1 a NA
77: 2010-01-01 2 a NA
78: 2010-01-01 3 a NA
79: 2010-01-01 1 b NA
80: 2010-01-01 4 b NA
81: 2010-01-01 5 b NA
82: 2010-01-01 2 c NA
83: 2010-01-01 3 c NA
84: 2010-01-01 4 c NA
85: 2010-01-01 1 d NA
86: 2010-01-01 2 d NA
87: 2010-01-01 5 d NA
88: 2010-01-01 3 e NA
89: 2010-01-01 4 e NA
90: 2010-01-01 5 e NA
date finish trainer strike_rate
3) 如果 OP 可以接受第二种方法,这里是将 by=trainer
带入 j
:)
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- cumsum(finish==1L)
cumfinishes <- replace(rep(NA_real_, .N), firstd, cs[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, rowid(trainer)[firstd - 1L])
newt <- which(trainer != shift(trainer))
prevTrainer <- replace(rep(NA_real_, .N), newt, cs[newt - 1L])
finishes <- cumfinishes - nafill(replace(prevTrainer, 1L, 0), "locf")
finishes <- replace(finishes, newt, NaN)
nafill(finishes, "locf") / nafill(k, "locf")
}]
4) 同样的想法使用 Rcpp
这应该是 最快的 并且也更具可读性:
library(Rcpp)
cppFunction("
NumericVector strike(IntegerVector date, IntegerVector finish, IntegerVector trainer) {
int i, sz = date.size();
double cumstrikes = 0, prevcs = NA_REAL, days = 1, prevdays = 1;
NumericVector strikes(sz), ndays(sz);
for (i = 0; i < sz; i++) {
strikes[i] = NA_REAL;
}
if (finish[0] == 1)
cumstrikes = 1;
for (i = 1; i < sz; i++) {
if (trainer[i-1] != trainer[i]) {
cumstrikes = 0;
days = 0;
} else if (date[i-1] != date[i]) {
strikes[i] = cumstrikes;
ndays[i] = days;
} else {
strikes[i] = strikes[i-1];
ndays[i] = ndays[i-1];
}
if (finish[i] == 1) {
cumstrikes++;
}
days++;
}
for (i = 0; i < sz; i++) {
strikes[i] /= ndays[i];
}
return strikes;
}")
dt[order(trainer, date), strike_rate := strike(date, finish, rleid(trainer))]