条件和的左连接横向

Left join lateral for conditional sums

我有一个包含客户、产品和类别的购买数据集。

customer     product     category    sales_value
       A     aerosol     air_care             10
       B     aerosol     air_care             12
       C     aerosol     air_care              7
       A     perfume     air_care              8
       A     perfume     air_care              2
       D     perfume     air_care             11
       C      burger         food             13
       D       fries         food              6
       C       fries         food              9

对于每种产品,我想知道至少购买过该产品一次的客户在该产品上花费的销售价值与在该产品类别上花费的销售价值之间的比率。

另一种说法:选取至少购买过 fries 一次的客户,并为所有这些客户计算 A) 在 fries 上花费的销售价值总和 B) 总和在 food.

上花费的销售价值

中间 table 将具有以下形式:

product    category  sum_spent_on_product           sum_spent_on_category    ratio
                                                 by_people_buying_product
aerosol    air_care                    29                              39     0.74
perfume    air_care                    21                              31     0.68
 burger        food                    13                              22     0.59
  fries        food                    15                              28     0.53

示例:至少 aerosol 购买过一次的人在此产品上总共花费了 1800。总体而言,同一个人在 air_care 类别(aerosol 所属)上花费了 3600。因此,aerosol 的比率为 0.5。

我尝试使用 left join lateral 来解决这个问题,并为每个 product 计算给定的中间结果,但我无法理解如何包含条件 only for customers who bought this specific product:

select
    distinct (product_id)
  , category
  , c.sales_category
from transactions t
left join lateral (
  select
    sum(sales_value) as sales_category
  from transactions
  where category = t.category
  group by category
) c on true
;

以上查询列出了每个产品在产品类别上的花费总和,但没有要求的产品购买者条件。

left join lateral 是正确的方法吗?在普通 SQL 中还有其他解决方案吗?

I want, for each product, the ratio between the sales value spent on this product, and the sales value spent on this product's category, by the customers who bought the product at least once.

如果我没理解错的话,你可以按人和类别汇总销售额以获得该类别的总计。在 Postgres 中,您可以保留一系列产品并将其用于匹配。所以,查询看起来像:

select p.product, p.category,
       sum(p.sales_value) as product_only_sales, 
       sum(pp.sales_value) as comparable_sales
from purchases p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category;

Here 是一个 db<>fiddle.

编辑:

数据允许产品日期重复。那把事情搞砸了。解决方案是为每个客户按产品进行预聚合:

select p.product, p.category, sum(p.sales_value) as product_only_sales, sum(pp.sales_value) as comparable_sales
from (select customer, category, product, sum(sales_value) as sales_value
      from purchases p
      group by customer, category, product
     ) p join
     (select customer, category, array_agg(distinct product) as products, sum(sales_value) as sales_value
      from purchases p
      group by customer, category
     ) pp
     on p.customer = pp.customer and p.category = pp.category and p.product = any (pp.products)
group by p.product, p.category

Here 是此示例的 db<>fiddle。

我会使用窗口函数来计算每个客户在每个类别中的总花费:

SELECT
  customer, product, category, sales_value,
  sum(sales_value) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions;

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |        8.00 |   20.00
 A        | perfume | air_care |        2.00 |   20.00
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | fries   | food     |        9.00 |   22.00
 C        | burger  | food     |       13.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

那我们就来总结一下吧。当客户多次购买同一产品时,就会出现问题。在您的示例中,客户 A 购买了两次香水。为了克服这个问题,让我们同时按客户、产品和类别分组(并对 sales_value 列求和):

SELECT
  customer, product, category, SUM(sales_value) AS sales_value,
  SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
FROM transactions
GROUP BY customer, product, category

 customer | product | category | sales_value | tot_cat 
----------+---------+----------+-------------+---------
 A        | aerosol | air_care |       10.00 |   20.00
 A        | perfume | air_care |       10.00 |   20.00 <-- this row summarizes rows 2 and 3 of previous result
 B        | aerosol | air_care |       12.00 |   12.00
 C        | aerosol | air_care |        7.00 |    7.00
 C        | burger  | food     |       13.00 |   22.00
 C        | fries   | food     |        9.00 |   22.00
 D        | perfume | air_care |       11.00 |   11.00
 D        | fries   | food     |        6.00 |    6.00

现在我们只需对 sales_value 和 tot_cat 求和即可得到中间结果 table。我使用一个常见的 table 表达式来获取名称 t:

下的先前结果
WITH t AS (
  SELECT
    customer, product, category, SUM(sales_value) AS sales_value,
    SUM(SUM(sales_value)) OVER (PARTITION BY customer, category) AS tot_cat
  FROM transactions
  GROUP BY customer, product, category
)
SELECT
  product, category,
  sum(sales_value) AS sales_value, sum(tot_cat) AS tot_cat,
  sum(sales_value) / sum(tot_cat) AS ratio
FROM t
GROUP BY product, category;

 product | category | sales_value | tot_cat |         ratio          
---------+----------+-------------+---------+------------------------
 aerosol | air_care |       29.00 |   39.00 | 0.74358974358974358974
 fries   | food     |       15.00 |   28.00 | 0.53571428571428571429
 burger  | food     |       13.00 |   22.00 | 0.59090909090909090909
 perfume | air_care |       21.00 |   31.00 | 0.67741935483870967742