Redshift - SQL - 分组结果的累积平均值

Question

我目前有一个 table 像这样：

Date	customer_id	sales
1/1	1	1
1/1	1	1
1/1	1	1
1/1	2	1
1/2	2	3
1/2	2	1
1/2	1	2
1/2	1	1
1/3	1	2
1/3	2	2
1/3	2	3
1/3	2	3

这最终由 customer_id 汇总得到 total_sales，如下所示：

customer_id	total_sales
1	8
2	13

然后我根据 table、average_sales 计算一个指标，定义为：

sum(total_sales) / count(distinct customer_id)

根据上述信息，这将导致 average_sales 为 10.5。

但是，我需要找到一种方法来计算这个平均值，但是对于每一天都是累积的，就像这样：

日期 1/1 将是 sum(total sales) for 1/1 / count(distinct customer_ids) for 1/1
日期 1/2 将是 sum(total sales) for 1/1-1/2 / count(distinct customer_ids) for 1/1-1/2
日期 1/3 将是 sum(total sales) for 1/1-1/3 / count(distinct customer_ids) for 1/1-1/3

最后一天（1/3）应等于总体平均指标 10.5。

最终 table 应该是这样的：

Date	average_sales
1/1	2 (4/2)
1/2	5.5 (11/2)
1/3	10.5 (21/2)

到目前为止，我已经用 grouping/window 函数尝试了多种方法，但似乎无法获得正确的数字。任何帮助将不胜感激！

Answer 1

主要问题是您不能将 COUNT(DISTINCT) 与 window 一起使用。

但是，无论如何都有一种计算方法很老套。

算出每个客户 ID 出现的第一个月
按客户出现的时间对客户进行排名
MAX(customer_rank) 是迄今为止看到的客户数量

这给...

WITH
  check_first_date AS
(
  SELECT
    *,
    MIN(date_id) OVER (PARTITION BY cust_id)   AS cust_id_first_date
  FROM
    example
),
  rank_customers_by_time AS
(
  SELECT
    *,
    DENSE_RANK() OVER (ORDER BY cust_id_first_date, cust_id)  AS cust_rank
  FROM
    check_first_date
)
SELECT
  date_id,
  MAX(MAX(cust_rank)) OVER (ORDER BY date_id)   AS customers_to_date,
  SUM(SUM(sales))     OVER (ORDER BY date_id)   AS sales_to_date
FROM
  rank_customers_by_time
GROUP BY
  date_id
ORDER BY
  date_id

然后你可以一个除以另一个。

随着时间的推移，还有其他方法可以实现 count-distinct，例如使用相关 sub-queries。我怀疑（我还没有测试过）它甚至更慢。

SELECT
  date_id,
  (
    SELECT COUNT(DISTINCT lookup.cust_id)
      FROM example AS lookup
     WHERE lookup.date_id <= example.date_id
  )
    AS customers_to_date,
  SUM(SUM(sales))     OVER (ORDER BY date_id)   AS sales_to_date
FROM
  example
GROUP BY
  date_id
ORDER BY
  date_id

这是一个演示（使用 postgresql，作为最接近 redshift 的近似值），数据略有不同，表明即使出现客户 ID 也能正常工作 'out of order'。

https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=a5a37f3337e42123424c5cf1dbfe0152

编辑： 更短（更快？）的版本 windows

对于每个 customer_id，确定它们的第一个行 （隐式要求行有一个唯一的 id).

总计迄今为止发生的第一行数...

WITH
  check_first_occurrence AS
(
  SELECT
    *,
    MIN(id) OVER (PARTITION BY cust_id)   AS cust_id_first_id
  FROM
    example
)
SELECT
  date_id,
  SUM(SUM(CASE WHEN id = cust_id_first_id THEN 1 ELSE 0 END)) OVER (ORDER BY date_id)   AS customers_to_date,
  SUM(SUM(sales                                            )) OVER (ORDER BY date_id)   AS sales_to_date
FROM
  check_first_occurrence
GROUP BY
  date_id
ORDER BY
  date_id

https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=94e5fb624a89170aaf819e2b3ccd01d6

这个版本应该显着对RedShift的水平缩放更加友好。

例如，假设您按客户分发并按日期排序

Redshift - SQL - 分组结果的累积平均值

Redshift - SQL - Cumulative average for grouped results

amazon-web-services

amazon-redshift

Date	customer_id	sales
1/1	1	1
1/1	1	1
1/1	1	1
1/1	2	1
1/2	2	3
1/2	2	1
1/2	1	2
1/2	1	1
1/3	1	2
1/3	2	2
1/3	2	3
1/3	2	3

Date	customer_id	sales
1/1	1	1
1/1	1	1
1/1	1	1
1/1	2	1
1/2	2	3
1/2	2	1
1/2	1	2
1/2	1	1
1/3	1	2
1/3	2	2
1/3	2	3
1/3	2	3

Date	customer_id	sales
1/1	1	1
1/1	1	1
1/1	1	1
1/1	2	1
1/2	2	3
1/2	2	1
1/2	1	2
1/2	1	1
1/3	1	2
1/3	2	2
1/3	2	3
1/3	2	3