postgresql：合并行保留一些信息，没有循环

Question

我有每个用户的通话列表，有时以分钟为单位。用户可以在这些通话中购买或不购买。当用户在上次通话后的 45 分钟内拨打电话时，我需要考虑这是与第一次通话相同的通话。

我需要获得最终的通话次数（将间隔不到 45 分钟的通话汇总）以及每个用户购买商品的通话次数。

例如，我有一个这样的列表：

buyer       timestamp              bougth_flag        
tom         20150201 9:15                 1
anna        20150201 9:25                 0
tom         20150201 10:15                0
tom         20150201 10:45                1
tom         20150201 10:48                1
anna        20150201 11:50                0
tom         20150201 11:52                0
anna        20150201 11:54                0

最后的 table 将是：

buyer       time_started        calls              articles_bought        
tom         20150201 9:15          1                        1
anna        20150201 9:25          1                        0
tom         20150201 10:15         3                        2
anna        20150201 10:50         2                        0
tom         20150201 11:52         1                        0

所以，我需要合并相隔不到 45 分钟的行，并且每个用户仍然分开。使用循环很容易做到这一点，但我在使用的 postgresql 中没有循环或 functions/procedures 。关于如何做到这一点有什么想法吗？

谢谢

Answer 1

最大的问题是您需要每 45 分钟对结果进行一次分组，这使它变得棘手。这个查询是一个很好的起点，但它并不完全正确。它应该可以帮助您继续前进：

SELECT a.buyer,
       MIN(a.timestamp),
       COUNT(a),
       COUNT(b),
       SUM(a.bougth_flag),
       SUM(b.bougth_flag)
FROM calls a
LEFT JOIN calls b ON (a.buyer = b.buyer
                      AND a.timestamp != b.timestamp
                      AND a.timestamp < b.timestamp
                      AND a.timestamp + '45 minutes'::INTERVAL > b.timestamp)
GROUP BY a.buyer,
         DATE_TRUNC('hour', a.timestamp) ;

结果：

┌───────┬─────────────────────┬───────┬───────┬─────┬─────┐
│ buyer │         min         │ count │ count │ sum │ sum │
├───────┼─────────────────────┼───────┼───────┼─────┼─────┤
│ tom   │ 2015-02-01 11:52:00 │     1 │     0 │   0 │   Ø │
│ anna  │ 2015-02-01 11:50:00 │     2 │     1 │   0 │   0 │
│ anna  │ 2015-02-01 09:25:00 │     1 │     0 │   0 │   Ø │
│ tom   │ 2015-02-01 09:15:00 │     1 │     0 │   1 │   Ø │
│ tom   │ 2015-02-01 10:15:00 │     4 │     3 │   2 │   3 │
└───────┴─────────────────────┴───────┴───────┴─────┴─────┘

Answer 2

由于您事先不知道 "call" 会持续多长时间（您可以在一整天内每 30 分钟接到某个买家的电话 - 请参阅问题评论），您只能解决这与递归 CTE。（请注意，我将您的专栏 'timestamp' 更改为 'ts'。切勿使用关键字作为 table或列名。)

WITH conversations AS (
  WITH RECURSIVE calls AS (
    SELECT buyer, ts, bought_flag, row_number() OVER (ORDER BY ts) AS conversation, 1::int AS calls
    FROM (
      SELECT buyer, ts, lag(ts) OVER (PARTITION BY buyer ORDER BY ts) AS lag, bought_flag
      FROM list) sub
    WHERE lag IS NULL OR ts - lag > interval '45 minutes'
    UNION ALL
    SELECT l.buyer, l.ts, l.bought_flag, c.conversation, c.calls + 1
    FROM list l
    JOIN calls c ON c.buyer = l.buyer AND l.ts > c.ts
    WHERE l.ts - c.ts < interval '45 minutes'
  )
  SELECT buyer, ts, bought_flag, conversation, max(calls) AS calls
  FROM calls
  GROUP BY buyer, ts, bought_flag, conversation
  order by conversation, ts
)
SELECT buyer, min(ts) AS time_started, max(calls) AS calls, sum(bought_flag) AS articles_bought
FROM conversations
GROUP BY buyer, conversation
ORDER BY time_started

几句解释：

内部递归 CTE 的起始项有一个子查询，每次调用都从 table 获取基本数据，以及上次调用的时间。内部 CTE 起始项中的主查询仅保留没有先前调用 (lag IS NULL) 或先前调用距离超过 45 分钟的那些行。因此，这些是我在这里称为 "conversation" 的初始调用。对话获得一列和一个 id，它只是查询中的行号，另一列用于跟踪对话中的呼叫次数 "calls"。
在递归项中，添加同一对话中的连续调用，"calls" 计数器递增。
当调用非常接近时（例如 10:45 和 10:48 在 10:15 之后）然后后面的调用可能会被包含多次，那些重复的（10:48 ) 通过为每个对话选择序列中最早的呼叫而被丢弃在外部 CTE 中。
在主查询中，最后，'bought_flag' 列针对每位买家的每次对话进行汇总。

Answer 3

感谢 Patrick 关于原始版本的通知。您在这里肯定需要 WINDOW 功能，但 CTE 在这里是可选的。

with start_points as(
  select tmp.*,
  --calculate distance between start points
  (lead(ts) OVER w)-ts AS start_point_lead from( select t.*, ts - (lag(ts) OVER w) AS lag from test t window w as (PARTITION BY buyer ORDER BY ts)
  ) tmp where lag is null or lag>interval '45 minutes' 
        window w as (PARTITION BY buyer ORDER BY ts) order by ts
 )
 select s.buyer, s.ts, count(*), sum(t.bougth_flag) from start_points s join test t 
 on t.buyer=s.buyer and (t.ts-s.ts<s.start_point_lead or s.start_point_lead is null)and t.ts>=s.ts
group by s.buyer, s.ts order by s.ts

postgresql：合并行保留一些信息，没有循环

postgresql: merge rows keeping some information, without loops

postgresql

amazon-redshift