如何按差异和平均天数聚合日期?

How to Aggregate Dates By Difference And Average Distance in Days?

我有一个收银机交易数据库。记录按篮子中的产品拆分:

     Date    Hour  Cust  Prod Basket Spend
1| 20160416    8    C1    P1    B2     10
2| 20160416    8    C1    P2    B2     20
3| 20160115   15    C1    P3    B1     30
4| 20160115   15    C1    P2    B1     50
5| 20161023   11    C1    P4    B3     60

我想看:

DaysSinceLastVisit  Cust Basket Spend
      NULL           C1    B1     30
        92           C1    B2     80
       190           C1    B3     60

AvgDaysBetweenVisits Cust AvgSpent
          141         C1    56.57

我不知道如何在 GROUP BY 期间对日期执行聚合函数。 SO 上的所有其他帖子似乎有 2 个 start/end 日期 [1] [2] [3]。

这是我目前尝试过的方法:

SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function

另外,应该注意的是我 运行 this on r with sqldf,它使用 SQLite 语法。但是,我更喜欢 SQLite 解决方案。

试试这个-

df <- data.frame("Date"=c("20160416","20160416","20160115","20160115","20161023"),
             "Hour"=c(8,8,15,15,11), "Cust"=c("C1","C1","C1","C1","C1"),
             "Prod"=c("P1","P2","P3","P2","P4"), "Basket"=c("B2","B2","B1","B1","B3"),
             "Spend"=c(10,20,30,50,60))

df$Date <- as.Date(df$Date, format = "%Y%m%d")

# Aggregate the data first
df2 <- aggregate(Spend ~ Date + Cust + Basket, data = df, FUN = sum)

# Now get days since last visit
df2$Date <- c(0, diff(df2$Date, 1))

# And finally
df3 <- aggregate(cbind(Date, Spend) ~ Cust, data = df2, FUN = mean)

day_since_last_visit 是相对于今天的 date+time 来说的,比较实用。但是,如果您得到第 1 和第 2 以及第 2 和第 3 之间的差异,它将是 92 和 190,这与您的数据相似。处理该部分的最佳方法是在游标中,也可以在查询中完成,但会更复杂一些..

   select   round( julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  ) ,2 )      days_since_last_visit,
           date, cust, basket, sum(spend) total_spend 
     from customer
 group by  cust, basket, date

访问日期的平均值和每条记录的今天日期

   select  round(avg( julian_days) ,2)  average_days , cust,   round(avg(total_spend) ,2)  average_spent
     from 
           ( select   julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  )      julian_days, date,
                      cust, basket, sum(spend) total_spend
               from customer
           group by  cust, basket, date )
 group by cust 

创建和插入脚本仅供参考

 create table customer ( date text , hour  integer, cust text, prod text, basket text, spend integer )

 insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)

这在问题中通过 sqldf 使用 SQLite。

我们首先在with子句中定义三个table(仅在SQL语句期间):

  1. aa 是 table a 带有额外的朱利安日期列 suitable 用于差分
  2. tab_days 是一个 table,它使用 aa 通过适当聚合的 join
  3. 来定义差异天数
  4. tab_sum_spend 是一个包含 Spend
  5. 的 table

最后我们加入最后两个并适当排序。

library(sqldf) 
# see note at end for a in reproducible form

t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
                             substr(Date, 5, 2) || '-' ||
                             substr(Date, 7, 2)) juldate, 
                   * 
            FROM a),     
     tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
                  FROM   aa a1
                          LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
                  GROUP  BY a1.Cust, a1.Date, a1.Basket),
     tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
                       FROM   aa
                       GROUP  BY Cust, Date, Basket) 
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER  BY Cust, Date, Basket
")
t1

##    Days Cust Basket Spend
## 1  <NA>   C1     B1    80
## 2  92.0   C1     B2    30
## 3 190.0   C1     B3    60

第二个问题:

sqldf("SELECT avg(Days)  AvgDays, Cust, avg(Spend) AvgSpend FROM   t1")
##   AvgDays Cust AvgSpend
## 1     141   C1 56.66667

注意: 可重现形式的 data.frame a 是:

Lines <- "Date Hour Cust Prod Basket Spend
1 20160416    8   C1   P1     B2    10
2 20160416    8   C1   P2     B2    20
3 20160115   15   C1   P3     B1    30
4 20160115   15   C1   P2     B1    50
5 20161023   11   C1   P4     B3    60"
a <- read.table(text = Lines, as.is = TRUE)