如何按差异和平均天数聚合日期?
How to Aggregate Dates By Difference And Average Distance in Days?
我有一个收银机交易数据库。记录按篮子中的产品拆分:
Date Hour Cust Prod Basket Spend
1| 20160416 8 C1 P1 B2 10
2| 20160416 8 C1 P2 B2 20
3| 20160115 15 C1 P3 B1 30
4| 20160115 15 C1 P2 B1 50
5| 20161023 11 C1 P4 B3 60
我想看:
DaysSinceLastVisit Cust Basket Spend
NULL C1 B1 30
92 C1 B2 80
190 C1 B3 60
和
AvgDaysBetweenVisits Cust AvgSpent
141 C1 56.57
我不知道如何在 GROUP BY 期间对日期执行聚合函数。 SO 上的所有其他帖子似乎有 2 个 start/end 日期 [1] [2] [3]。
这是我目前尝试过的方法:
SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function
另外,应该注意的是我 运行 this on r with sqldf,它使用 SQLite 语法。但是,我更喜欢 SQLite 解决方案。
试试这个-
df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
"Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
"Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
"Spend"=c(10,20,30,50,60))
df$Date <- as.Date(df$Date, format = "%Y%m%d")
# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)
# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))
# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)
day_since_last_visit 是相对于今天的 date+time 来说的,比较实用。但是,如果您得到第 1 和第 2 以及第 2 和第 3 之间的差异,它将是 92 和 190,这与您的数据相似。处理该部分的最佳方法是在游标中,也可以在查询中完成,但会更复杂一些..
select round( julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) ,2 ) days_since_last_visit,
date, cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date
访问日期的平均值和每条记录的今天日期
select round(avg( julian_days) ,2) average_days , cust, round(avg(total_spend) ,2) average_spent
from
( select julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) julian_days, date,
cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date )
group by cust
创建和插入脚本仅供参考
create table customer ( date text , hour integer, cust text, prod text, basket text, spend integer )
insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)
这在问题中通过 sqldf 使用 SQLite。
我们首先在with
子句中定义三个table(仅在SQL语句期间):
aa
是 table a
带有额外的朱利安日期列 suitable 用于差分
tab_days
是一个 table,它使用 aa
通过适当聚合的 join 来定义差异天数
tab_sum_spend
是一个包含 Spend
和 的 table
最后我们加入最后两个并适当排序。
library(sqldf)
# see note at end for a in reproducible form
t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
substr(Date, 5, 2) || '-' ||
substr(Date, 7, 2)) juldate,
*
FROM a),
tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
FROM aa a1
LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
GROUP BY a1.Cust, a1.Date, a1.Basket),
tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
FROM aa
GROUP BY Cust, Date, Basket)
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER BY Cust, Date, Basket
")
t1
## Days Cust Basket Spend
## 1 <NA> C1 B1 80
## 2 92.0 C1 B2 30
## 3 190.0 C1 B3 60
第二个问题:
sqldf("SELECT avg(Days) AvgDays, Cust, avg(Spend) AvgSpend FROM t1")
## AvgDays Cust AvgSpend
## 1 141 C1 56.66667
注意: 可重现形式的 data.frame a
是:
Lines <- "Date Hour Cust Prod Basket Spend
1 20160416 8 C1 P1 B2 10
2 20160416 8 C1 P2 B2 20
3 20160115 15 C1 P3 B1 30
4 20160115 15 C1 P2 B1 50
5 20161023 11 C1 P4 B3 60"
a <- read.table(text = Lines, as.is = TRUE)
我有一个收银机交易数据库。记录按篮子中的产品拆分:
Date Hour Cust Prod Basket Spend
1| 20160416 8 C1 P1 B2 10
2| 20160416 8 C1 P2 B2 20
3| 20160115 15 C1 P3 B1 30
4| 20160115 15 C1 P2 B1 50
5| 20161023 11 C1 P4 B3 60
我想看:
DaysSinceLastVisit Cust Basket Spend
NULL C1 B1 30
92 C1 B2 80
190 C1 B3 60
和
AvgDaysBetweenVisits Cust AvgSpent
141 C1 56.57
我不知道如何在 GROUP BY 期间对日期执行聚合函数。 SO 上的所有其他帖子似乎有 2 个 start/end 日期 [1] [2] [3]。
这是我目前尝试过的方法:
SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function
另外,应该注意的是我 运行 this on r with sqldf,它使用 SQLite 语法。但是,我更喜欢 SQLite 解决方案。
试试这个-
df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
"Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
"Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
"Spend"=c(10,20,30,50,60))
df$Date <- as.Date(df$Date, format = "%Y%m%d")
# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)
# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))
# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)
day_since_last_visit 是相对于今天的 date+time 来说的,比较实用。但是,如果您得到第 1 和第 2 以及第 2 和第 3 之间的差异,它将是 92 和 190,这与您的数据相似。处理该部分的最佳方法是在游标中,也可以在查询中完成,但会更复杂一些..
select round( julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) ,2 ) days_since_last_visit,
date, cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date
访问日期的平均值和每条记录的今天日期
select round(avg( julian_days) ,2) average_days , cust, round(avg(total_spend) ,2) average_spent
from
( select julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) julian_days, date,
cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date )
group by cust
创建和插入脚本仅供参考
create table customer ( date text , hour integer, cust text, prod text, basket text, spend integer )
insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)
这在问题中通过 sqldf 使用 SQLite。
我们首先在with
子句中定义三个table(仅在SQL语句期间):
aa
是 tablea
带有额外的朱利安日期列 suitable 用于差分tab_days
是一个 table,它使用aa
通过适当聚合的 join 来定义差异天数
tab_sum_spend
是一个包含Spend
和 的 table
最后我们加入最后两个并适当排序。
library(sqldf)
# see note at end for a in reproducible form
t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
substr(Date, 5, 2) || '-' ||
substr(Date, 7, 2)) juldate,
*
FROM a),
tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
FROM aa a1
LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
GROUP BY a1.Cust, a1.Date, a1.Basket),
tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
FROM aa
GROUP BY Cust, Date, Basket)
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER BY Cust, Date, Basket
")
t1
## Days Cust Basket Spend
## 1 <NA> C1 B1 80
## 2 92.0 C1 B2 30
## 3 190.0 C1 B3 60
第二个问题:
sqldf("SELECT avg(Days) AvgDays, Cust, avg(Spend) AvgSpend FROM t1")
## AvgDays Cust AvgSpend
## 1 141 C1 56.66667
注意: 可重现形式的 data.frame a
是:
Lines <- "Date Hour Cust Prod Basket Spend
1 20160416 8 C1 P1 B2 10
2 20160416 8 C1 P2 B2 20
3 20160115 15 C1 P3 B1 30
4 20160115 15 C1 P2 B1 50
5 20161023 11 C1 P4 B3 60"
a <- read.table(text = Lines, as.is = TRUE)