Select PostgreSQL 中另一个 table 中每行时间戳后的第一个事件

Question

我有一个 table 某人在某个时间戳访问了某个城市：

city_visits:

person_id         city                timestamp
-----------------------------------------------
        1        Paris      2017-01-01 00:00:00
        1    Amsterdam      2017-01-03 00:00:00
        1     Brussels      2017-01-04 00:00:00
        1       London      2017-01-06 00:00:00
        2       Berlin      2017-01-01 00:00:00
        2     Brussels      2017-01-02 00:00:00
        2       Berlin      2017-01-06 00:00:00
        2      Hamburg      2017-01-07 00:00:00

另一个 table 列出了某人购买冰淇淋的时间：

ice_cream_events:

person_id      flavour                timestamp
-----------------------------------------------
        1      Vanilla      2017-01-02 00:12:00
        1    Chocolate      2017-01-05 00:18:00
        2   Strawberry      2017-01-03 00:09:00
        2      Caramel      2017-01-05 00:15:00

对于 city_visits table 中的每一行，我需要加入同一个人的下一个冰淇淋活动，以及它的时间戳和口味：

desired_output:

person_id       city            timestamp  ic_flavour          ic_timestamp
---------------------------------------------------------------------------
        1      Paris  2017-01-01 00:00:00     Vanilla   2017-01-02 00:12:00
        1  Amsterdam  2017-01-03 00:00:00   Chocolate   2017-01-05 00:18:00
        1   Brussels  2017-01-04 00:00:00   Chocolate   2017-01-05 00:18:00
        1     London  2017-01-06 00:00:00        null                  null
        2     Berlin  2017-01-01 00:00:00  Strawberry   2017-01-03 00:09:00
        2   Brussels  2017-01-02 00:00:00  Strawberry   2017-01-03 00:09:00
        2     Berlin  2017-01-06 00:00:00        null                  null
        2    Hamburg  2017-01-07 00:00:00        null                  null

我试过以下方法：

SELECT DISTINCT ON (cv.person_id, cv.timestamp)
  cv.person_id,
  cv.city,
  cv.timestamp,
  ic.flavour as ic_flavour,
  ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
    ON ic.person_id = cv.person_id
   AND ic.timestamp > cv.timestamp

DISTINCT ON 子句可防止每次访问城市时参加除一次以外的所有未来冰淇淋活动。它有效，但它不会自动 select 第一个，而是它似乎会为同一个人选择未来的任何冰淇淋事件。我可以添加的任何 ORDER BY 子句似乎都不会改变这一点。

解决这个问题的理想方法是让 DISTINCT ON 子句在每次必须过滤掉重复项时选择最小值 ic_timestamp。

Answer 1

看来 DISTINCT ON 子句实际上是在 ORDER BY 子句之后。

因此，通过添加正确的顺序解决了问题：

SELECT DISTINCT ON (cv.person_id, cv.timestamp)
  cv.person_id,
  cv.city,
  cv.timestamp,
  ic.flavour as ic_flavour,
  ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
    ON ic.person_id = cv.person_id
   AND ic.timestamp > cv.timestamp
ORDER BY cv.person_id, cv.timestamp ASC, ic.timestamp ASC  -- <- this line added

Answer 2

由于 ice_cream_events 中没有 city，您的查询将加入 lots of ice-在选择最早的访问之前，为每次访问创建奶油事件。我建议改为 LEFT JOIN LATERAL ，在这种情况下，如果有适当的索引支持，这会快得多：

SELECT *
FROM   city_visits cv
LEFT   JOIN LATERAL (
   SELECT flavour AS ic_flavour, timestamp AS ic_timestamp
   FROM   ice_cream_events 
   WHERE  person_id = cv.person_id
   AND    timestamp > cv.timestamp
   ORDER  BY timestamp
   LIMIT  1
   ) ice ON true
ORDER  BY cv.person_id, cv.timestamp;

LEFT [OUTER] JOIN 包括没有任何冰淇淋的访问。如果您只想带冰淇淋去拜访，请切换到 CROSS JOIN。

在这种情况下，外部 ORDER BY 仅对结果行进行排序（不同于与 DISTINCT ON 组合时，它还决定从每组对等项中选择哪一行）。

Select first row in each GROUP BY group?

如果表很大，一定要有适当的索引来加快速度。理想情况下，ice_cream_events (person_id, timestamp, flavour) 上的复合索引 - this 顺序中的列。在 city_visits (person_id, timestamp) 上进行外部排序。或者甚至可以在 city_visits (person_id, timestamp, city) 上允许另一个仅索引扫描。取决于你的实际情况。这个例子显然是象征性的。

Optimize GROUP BY query to retrieve latest record per user

Select PostgreSQL 中另一个 table 中每行时间戳后的第一个事件

Select first event after a timestamp per row in another table in PostgreSQL

sql

postgresql

greatest-n-per-group

distinct-on