(Presto)SQL:对 "A" 和 "B" 列进行分组,并对 "C" 列进行计数,但还包括仅按 "A" 分组的 "C" 的计数]

(Presto) SQL: Group by on columns "A" and "B" and count column "C", but also include count of "C" grouped only by "A"

题目感觉有点怪怪的,如果你能想出更好的,欢迎帮忙。

你好,

想象这样的情况 - 有一个“销售额”table,它有 3 列:datestoresale_price,每行表示一件商品的销售额:


date           |  store  |  sale_price
---------------+---------+------------
2021-09-01     |   foo   |    15
2021-09-01     |   foo   |    10
2021-09-01     |   foo   |    10
2021-09-01     |   bar   |     5
2021-09-02     |   foo   |    30
2021-09-02     |   bar   |    40
2021-09-02     |   bar   |    20
etc...

我想做的是创建一个按 datestore 分组的查询,并计算每家商店每天售出的商品数量(因此,忽略价格)。到目前为止它非常简单,但出于可视化目的,我还尝试添加一个额外的行,即 per day 还包括销售计数的总和。

这是我正在寻找的最终结果:


date           |    store    |  sales_count
---------------+-------------+------------
2021-09-01     |     foo     |     3
2021-09-01     |     bar     |     1
2021-09-01     |  aggregate  |     4
2021-09-02     |     foo     |     1
2021-09-02     |     bar     |     2
2021-09-02     |  aggregate  |     3
etc...

我知道我可以通过 UNION ALL 来创建它,但它不是非常有效,因为它会扫描原始 table 两次:

SELECT date,
       store,
       count(sale_price) AS sales_count
  FROM sales
 GROUP BY 1, 2

 UNION ALL

SELECT date,
       'aggregate' AS store,
       count(sale_price) AS sales_count
  FROM sales
 GROUP BY 1

我也知道我可以使用 over() 子句创建一个额外的列,并避免扫描“销售”两次,但这样我就会有两个不同的列,而不是像我正在寻找的那样只有一个:

SELECT date,
       store,
       count(sale_price) AS sales_count,
       sum(count(sale_price)) over(PARTITION BY date) AS sales_per_day
  FROM sales
 GROUP BY 1, 2

--->


date           |    store    |  sales_count |  sales_per_day
---------------+-------------+--------------+-----------------
2021-09-01     |     foo     |      3       |        4
2021-09-01     |     bar     |      1       |        4
2021-09-02     |     foo     |      1       |        3
2021-09-02     |     bar     |      2       |        3
etc...

是否可以在不扫描两次的情况下实现我想要做的事情?最后两列(sales_countsales_per_day)能否以某种方式合并? 预先感谢您的帮助。

您可以使用 GROUPING SETS, CUBE and ROLLUP to aggregate at a different levels within the same query. You can also use the GROUPING 操作来确定在给定输出行的组中考虑了哪些列:

WITH data(day, store, sale_price) AS (
    VALUES
        (DATE '2021-09-01', 'foo', 15),
        (DATE '2021-09-01', 'foo', 10),
        (DATE '2021-09-01', 'foo', 10),
        (DATE '2021-09-01', 'bar',  5),
        (DATE '2021-09-02', 'foo', 30),
        (DATE '2021-09-02', 'bar', 40),
        (DATE '2021-09-02', 'bar', 20)
)
SELECT day,
    if(grouping(store) = 1, '<aggregate>', store),
    count(sale_price) as sales_count
FROM data
GROUP BY GROUPING SETS ((day), (day, store))
ORDER BY day, grouping(store)