需要帮助优化 cumsum 之类的代码 - sqldf、data.table、非等连接
Need help optimizing cumsum like code - sqldf, data.table, non-equi joins
寻求帮助以优化我的 sqldf 代码,该代码基于非 equi 连接生成聚合的历史统计信息,即数据应该只聚合到当前数据行。
重要的是任何解决方案都能够适用于许多不同的组,例如在 sqldf 示例中通过 tourney_name 过滤聚合等。
获取数据:
library(dplyr); library(sqldf); data_list <- list()
for(i in 2000:2018){
data_list[[i]] <-
readr::read_csv(paste0('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_',i,'.csv')) %>%
as.data.frame}
data <- data.table::rbindlist(data_list)
data <- select(data, tourney_name, tourney_date, match_num, winner_id, winner_name, loser_id, loser_name)
system.time(
data2 <- sqldf("select a.*,
count(b.winner_id) as winner_overall_wins
from data a
left join data b
on (a.winner_id = b.winner_id and a.tourney_date > b.tourney_date)
group by a.tourney_name, a.tourney_date, a.match_num, a.winner_id
order by tourney_date desc, tourney_name, match_num desc",
stringsAsFactors = FALSE)
) # takes 16 sec, would like to look for a vectorized solution
head(data2)
我尝试加速代码的方法:
For 循环 - 太慢
Dplyr 已满 join/filter - 内存超过 60gb。
Data.table/cumsum - 无法使代码正常工作。更喜欢非 data.table 方法,但愿意学习通用解决方案
谢谢!
OP 要求优化一些级联 sqldf
语句(在 OP 编辑之前)。不幸的是,OP 没有口头解释他实施了哪些聚合。因此,需要进行大量的逆向工程。
无论如何,这里是我会使用data.table
达到相同结果的方法。 OP 的 sqldf
代码的执行时间从 16 秒减少到 data.table
版本的不到 0.2 秒。
data.table
个已编辑示例的版本
OP 已编辑问题以减少 sqldf
语句的数量。现在,只计算一个聚合。
data2
中的新列 winner_overall_wins
是获胜者在 实际比赛开始之前 赢得的所有比赛的计数。该号码附在获胜者赢得的实际锦标赛的所有比赛中。 (请注意,这与实际比赛前获胜的比赛计数不同)。
自版本 1.9.8(于 2016 年 11 月 25 日在 CRAN 上)起,data.table
能够执行 非等值连接。此外,可以建议 fread()
只读取选定的列,这会进一步加快 I/O.
library(data.table) # v1.11.2
urls <- sprintf(
"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv",
2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num",
"winner_id", "winner_name",
"loser_id", "loser_name")
# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))
# non-equi join to compute aggregate, second join to append, order result
system.time({
result_nej <- matches[
unique(matches[matches, on = .(winner_id, tourney_date < tourney_date),
.(winner_overall_wins = .N), by = .EACHI]),
on = .(winner_id, tourney_date)][
order(-tourney_date, tourney_name, -match_num)]
})
这两个 data.table
连接和后续排序在我的系统上花费了大约 0.15 秒的时间,而 OP 的 sqldf
代码的各种运行时间为 16 到 19 秒。
特定玩家的历史可以通过
检索
p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]
tourney_name tourney_date match_num winner_id winner_name loser_id loser_name winner_overall_wins
1: Australian Open 20180115 701 103819 Roger Federer 105227 Marin Cilic 1128
2: Australian Open 20180115 602 103819 Roger Federer 111202 Hyeon Chung 1128
3: Australian Open 20180115 504 103819 Roger Federer 104607 Tomas Berdych 1128
4: Australian Open 20180115 408 103819 Roger Federer 105916 Marton Fucsovics 1128
5: Australian Open 20180115 316 103819 Roger Federer 104755 Richard Gasquet 1128
---
1131: Marseille 20000207 3 103819 Roger Federer 102179 Antony Dupuis 4
1132: Davis Cup WG R1: SUI vs AUS 20000204 2 103819 Roger Federer 102882 Mark Philippoussis 3
1133: Australian Open 20000117 90 103819 Roger Federer 102466 Jan Kroslak 1
1134: Australian Open 20000117 52 103819 Roger Federer 102021 Michael Chang 1
1135: Adelaide 20000103 2 103819 Roger Federer 102533 Jens Knippschild 0
有一个替代的更快的解决方案,使用 cumsum()
和 shift()
:
system.time({
# cumumlative operations require ordered data
setorder(matches, tourney_date, tourney_name, match_num)
# add tourney id for convenience and conciseness
matches[, t_id := rleid(tourney_date, tourney_name)]
# aggregate by player and tourney
p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
# compute cumulative sum for each player and
# lag to show only matches of previous tourneys
tmp <- p_t_hist[order(t_id),
.(t_id, winner_overall_wins = shift(cumsum(winner_won))),
by = winner_id]
# append new column & order result
result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]
在我的系统上,经过的时间大约为 0.05 秒,比非 equi 连接变体快 3 倍,比 OP 的方法快很多。
寻求帮助以优化我的 sqldf 代码,该代码基于非 equi 连接生成聚合的历史统计信息,即数据应该只聚合到当前数据行。
重要的是任何解决方案都能够适用于许多不同的组,例如在 sqldf 示例中通过 tourney_name 过滤聚合等。
获取数据:
library(dplyr); library(sqldf); data_list <- list()
for(i in 2000:2018){
data_list[[i]] <-
readr::read_csv(paste0('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_',i,'.csv')) %>%
as.data.frame}
data <- data.table::rbindlist(data_list)
data <- select(data, tourney_name, tourney_date, match_num, winner_id, winner_name, loser_id, loser_name)
system.time(
data2 <- sqldf("select a.*,
count(b.winner_id) as winner_overall_wins
from data a
left join data b
on (a.winner_id = b.winner_id and a.tourney_date > b.tourney_date)
group by a.tourney_name, a.tourney_date, a.match_num, a.winner_id
order by tourney_date desc, tourney_name, match_num desc",
stringsAsFactors = FALSE)
) # takes 16 sec, would like to look for a vectorized solution
head(data2)
我尝试加速代码的方法:
For 循环 - 太慢
Dplyr 已满 join/filter - 内存超过 60gb。
Data.table/cumsum - 无法使代码正常工作。更喜欢非 data.table 方法,但愿意学习通用解决方案
谢谢!
OP 要求优化一些级联 sqldf
语句(在 OP 编辑之前)。不幸的是,OP 没有口头解释他实施了哪些聚合。因此,需要进行大量的逆向工程。
无论如何,这里是我会使用data.table
达到相同结果的方法。 OP 的 sqldf
代码的执行时间从 16 秒减少到 data.table
版本的不到 0.2 秒。
data.table
个已编辑示例的版本
OP 已编辑问题以减少 sqldf
语句的数量。现在,只计算一个聚合。
data2
中的新列 winner_overall_wins
是获胜者在 实际比赛开始之前 赢得的所有比赛的计数。该号码附在获胜者赢得的实际锦标赛的所有比赛中。 (请注意,这与实际比赛前获胜的比赛计数不同)。
自版本 1.9.8(于 2016 年 11 月 25 日在 CRAN 上)起,data.table
能够执行 非等值连接。此外,可以建议 fread()
只读取选定的列,这会进一步加快 I/O.
library(data.table) # v1.11.2
urls <- sprintf(
"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv",
2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num",
"winner_id", "winner_name",
"loser_id", "loser_name")
# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))
# non-equi join to compute aggregate, second join to append, order result
system.time({
result_nej <- matches[
unique(matches[matches, on = .(winner_id, tourney_date < tourney_date),
.(winner_overall_wins = .N), by = .EACHI]),
on = .(winner_id, tourney_date)][
order(-tourney_date, tourney_name, -match_num)]
})
这两个 data.table
连接和后续排序在我的系统上花费了大约 0.15 秒的时间,而 OP 的 sqldf
代码的各种运行时间为 16 到 19 秒。
特定玩家的历史可以通过
检索p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]
tourney_name tourney_date match_num winner_id winner_name loser_id loser_name winner_overall_wins 1: Australian Open 20180115 701 103819 Roger Federer 105227 Marin Cilic 1128 2: Australian Open 20180115 602 103819 Roger Federer 111202 Hyeon Chung 1128 3: Australian Open 20180115 504 103819 Roger Federer 104607 Tomas Berdych 1128 4: Australian Open 20180115 408 103819 Roger Federer 105916 Marton Fucsovics 1128 5: Australian Open 20180115 316 103819 Roger Federer 104755 Richard Gasquet 1128 --- 1131: Marseille 20000207 3 103819 Roger Federer 102179 Antony Dupuis 4 1132: Davis Cup WG R1: SUI vs AUS 20000204 2 103819 Roger Federer 102882 Mark Philippoussis 3 1133: Australian Open 20000117 90 103819 Roger Federer 102466 Jan Kroslak 1 1134: Australian Open 20000117 52 103819 Roger Federer 102021 Michael Chang 1 1135: Adelaide 20000103 2 103819 Roger Federer 102533 Jens Knippschild 0
有一个替代的更快的解决方案,使用 cumsum()
和 shift()
:
system.time({
# cumumlative operations require ordered data
setorder(matches, tourney_date, tourney_name, match_num)
# add tourney id for convenience and conciseness
matches[, t_id := rleid(tourney_date, tourney_name)]
# aggregate by player and tourney
p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
# compute cumulative sum for each player and
# lag to show only matches of previous tourneys
tmp <- p_t_hist[order(t_id),
.(t_id, winner_overall_wins = shift(cumsum(winner_won))),
by = winner_id]
# append new column & order result
result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]
在我的系统上,经过的时间大约为 0.05 秒,比非 equi 连接变体快 3 倍,比 OP 的方法快很多。