为 ActivityType 进行 21 天滚动总和的最快方法
Fastest way for doing 21 day rolling sum for an ActivityType
我有一个大数据框(超过 300 万行)。我正在尝试计算某个 ActivityType 在 21 天内出现的次数 window。我已经根据 Rolling Sum by Another Variable in R 为我的解决方案建模。但是一个ActivityType需要很长时间。我认为 3M+ 行不会花费过多的时间。以下是我的尝试:
dt <- read.table(text='
Name ActivityType ActivityDate
John Email 1/1/2014
John Email 1/3/2014
John Webinar 1/5/2014
John Webinar 1/20/2014
John Webinar 3/25/2014
John Email 4/1/2014
John Email 4/20/2014
Tom Email 1/1/2014
Tom Webinar 1/5/2014
Tom Webinar 1/20/2014
Tom Webinar 3/25/2014
Tom Email 4/1/2014
Tom Email 4/20/2014
', header=T, row.names = NULL)
library(data.table)
library(reshape2)
dt$ActivityType <- factor(dt$ActivityType)
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
dt <- dt[order(dt$Name, dt$ActivityDate),]
dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
setDT(dt)
#Build reference table
Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
#Use mapply to get last 21 days of value by Name
dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]
这只是针对 ActivityType=Email,然后我必须对其他 ActivityType 级别执行相同的操作。我得到解决方案的 link 谈到使用 "mcapply" 而不是 "mapply"。请告诉我如何使用 mcapply 或任何其他可以使其更快的解决方案。
以下是预期的输出。对于每一行,我取 ActivityDate 和之前的 21 天,这 21 天就是我的时间 window。我一直在计算 ActivityType="Email" 出现的时间 window。
Name ActivityType ActivityDate Email_RollingSum
John Email 1/1/2014 1
John Email 1/3/2014 2
John Webinar 1/5/2014 2
John Webinar 1/20/2014 2
John Webinar 3/25/2014 0
John Email 4/1/2014 1
John Email 4/20/2014 2
Tom Email 1/1/2014 1
Tom Webinar 1/5/2014 1
Tom Webinar 1/20/2014 1
Tom Webinar 3/25/2014 0
Tom Email 4/1/2014 1
Tom Email 4/20/2014 2
尝试一种将数据 table 用于姓名和日期列表以及电子邮件数量来源的方法。这是在 data.table
中通过使用 DT
的 i
参数中的 DT
和 by = .EACHI
完成的。代码可能如下所示:
library(data.table)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
.(type, date,
Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),
by=.EACHI]
以下使用与上述相同的方法,但包含一些更改,根据您的数据,速度可能会提高 30-40%。
setDT(dt, key = "Name")
dt[, ":="(ActivityDate = as.Date(dt$ActivityDate, "%m/%d/%Y"),
ActivityType = as.character(ActivityType) )]
dt4 <- dt[.(Name=Name, type=ActivityType, date=ActivityDate), {z=between(ActivityDate, date-21, date);
.( type, date,
Email=sum( (ActivityType %chin% "Email") & z),
Webinar=sum( (ActivityType %chin% "Webinar") & z) ) }
, by=.EACHI]
setDT(dt)
dt[, ActivityDate := as.Date(ActivityDate, '%m/%d/%Y')]
# add index to keep track of rows
dt[, idx := .I]
# match the dates we're looking for using a rolling join and extract the row numbers
rr = dt[.(Name = Name, ActivityDate = ActivityDate - 21, refIdx = idx),
.(idx, refIdx), on = c('Name', 'ActivityDate'), roll = -Inf]
# idx refIdx
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 5 5
# 6: 5 6
# 7: 6 7
# 8: 8 8
# 9: 8 9
#10: 8 10
#11: 11 11
#12: 11 12
#13: 12 13
# extract the above rows and count occurrences using dcast
dcast(rr[, {seq = idx:refIdx; dt[seq]}, by = 1:nrow(rr)], nrow ~ ActivityType)
# nrow Email Webinar
#1 1 1 0
#2 2 2 0
#3 3 2 1
#4 4 2 2
#5 5 0 1
#6 6 1 1
#7 7 2 0
#8 8 1 0
#9 9 1 1
#10 10 1 2
#11 11 0 1
#12 12 1 1
#13 13 2 0
我有一个大数据框(超过 300 万行)。我正在尝试计算某个 ActivityType 在 21 天内出现的次数 window。我已经根据 Rolling Sum by Another Variable in R 为我的解决方案建模。但是一个ActivityType需要很长时间。我认为 3M+ 行不会花费过多的时间。以下是我的尝试:
dt <- read.table(text='
Name ActivityType ActivityDate
John Email 1/1/2014
John Email 1/3/2014
John Webinar 1/5/2014
John Webinar 1/20/2014
John Webinar 3/25/2014
John Email 4/1/2014
John Email 4/20/2014
Tom Email 1/1/2014
Tom Webinar 1/5/2014
Tom Webinar 1/20/2014
Tom Webinar 3/25/2014
Tom Email 4/1/2014
Tom Email 4/20/2014
', header=T, row.names = NULL)
library(data.table)
library(reshape2)
dt$ActivityType <- factor(dt$ActivityType)
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
dt <- dt[order(dt$Name, dt$ActivityDate),]
dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
setDT(dt)
#Build reference table
Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
#Use mapply to get last 21 days of value by Name
dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]
这只是针对 ActivityType=Email,然后我必须对其他 ActivityType 级别执行相同的操作。我得到解决方案的 link 谈到使用 "mcapply" 而不是 "mapply"。请告诉我如何使用 mcapply 或任何其他可以使其更快的解决方案。
以下是预期的输出。对于每一行,我取 ActivityDate 和之前的 21 天,这 21 天就是我的时间 window。我一直在计算 ActivityType="Email" 出现的时间 window。
Name ActivityType ActivityDate Email_RollingSum
John Email 1/1/2014 1
John Email 1/3/2014 2
John Webinar 1/5/2014 2
John Webinar 1/20/2014 2
John Webinar 3/25/2014 0
John Email 4/1/2014 1
John Email 4/20/2014 2
Tom Email 1/1/2014 1
Tom Webinar 1/5/2014 1
Tom Webinar 1/20/2014 1
Tom Webinar 3/25/2014 0
Tom Email 4/1/2014 1
Tom Email 4/20/2014 2
尝试一种将数据 table 用于姓名和日期列表以及电子邮件数量来源的方法。这是在 data.table
中通过使用 DT
的 i
参数中的 DT
和 by = .EACHI
完成的。代码可能如下所示:
library(data.table)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
.(type, date,
Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),
by=.EACHI]
以下使用与上述相同的方法,但包含一些更改,根据您的数据,速度可能会提高 30-40%。
setDT(dt, key = "Name")
dt[, ":="(ActivityDate = as.Date(dt$ActivityDate, "%m/%d/%Y"),
ActivityType = as.character(ActivityType) )]
dt4 <- dt[.(Name=Name, type=ActivityType, date=ActivityDate), {z=between(ActivityDate, date-21, date);
.( type, date,
Email=sum( (ActivityType %chin% "Email") & z),
Webinar=sum( (ActivityType %chin% "Webinar") & z) ) }
, by=.EACHI]
setDT(dt)
dt[, ActivityDate := as.Date(ActivityDate, '%m/%d/%Y')]
# add index to keep track of rows
dt[, idx := .I]
# match the dates we're looking for using a rolling join and extract the row numbers
rr = dt[.(Name = Name, ActivityDate = ActivityDate - 21, refIdx = idx),
.(idx, refIdx), on = c('Name', 'ActivityDate'), roll = -Inf]
# idx refIdx
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 5 5
# 6: 5 6
# 7: 6 7
# 8: 8 8
# 9: 8 9
#10: 8 10
#11: 11 11
#12: 11 12
#13: 12 13
# extract the above rows and count occurrences using dcast
dcast(rr[, {seq = idx:refIdx; dt[seq]}, by = 1:nrow(rr)], nrow ~ ActivityType)
# nrow Email Webinar
#1 1 1 0
#2 2 2 0
#3 3 2 1
#4 4 2 2
#5 5 0 1
#6 6 1 1
#7 7 2 0
#8 8 1 0
#9 9 1 1
#10 10 1 2
#11 11 0 1
#12 12 1 1
#13 13 2 0