Select PostgreSQL 中另一个 table 中每行时间戳后的第一个事件
Select first event after a timestamp per row in another table in PostgreSQL
我有一个 table 某人在某个时间戳访问了某个城市:
city_visits:
person_id city timestamp
-----------------------------------------------
1 Paris 2017-01-01 00:00:00
1 Amsterdam 2017-01-03 00:00:00
1 Brussels 2017-01-04 00:00:00
1 London 2017-01-06 00:00:00
2 Berlin 2017-01-01 00:00:00
2 Brussels 2017-01-02 00:00:00
2 Berlin 2017-01-06 00:00:00
2 Hamburg 2017-01-07 00:00:00
另一个 table 列出了某人购买冰淇淋的时间:
ice_cream_events:
person_id flavour timestamp
-----------------------------------------------
1 Vanilla 2017-01-02 00:12:00
1 Chocolate 2017-01-05 00:18:00
2 Strawberry 2017-01-03 00:09:00
2 Caramel 2017-01-05 00:15:00
对于 city_visits
table 中的每一行,我需要加入同一个人的下一个冰淇淋活动,以及它的时间戳和口味:
desired_output:
person_id city timestamp ic_flavour ic_timestamp
---------------------------------------------------------------------------
1 Paris 2017-01-01 00:00:00 Vanilla 2017-01-02 00:12:00
1 Amsterdam 2017-01-03 00:00:00 Chocolate 2017-01-05 00:18:00
1 Brussels 2017-01-04 00:00:00 Chocolate 2017-01-05 00:18:00
1 London 2017-01-06 00:00:00 null null
2 Berlin 2017-01-01 00:00:00 Strawberry 2017-01-03 00:09:00
2 Brussels 2017-01-02 00:00:00 Strawberry 2017-01-03 00:09:00
2 Berlin 2017-01-06 00:00:00 null null
2 Hamburg 2017-01-07 00:00:00 null null
我试过以下方法:
SELECT DISTINCT ON (cv.person_id, cv.timestamp)
cv.person_id,
cv.city,
cv.timestamp,
ic.flavour as ic_flavour,
ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
ON ic.person_id = cv.person_id
AND ic.timestamp > cv.timestamp
DISTINCT ON
子句可防止每次访问城市时参加除一次以外的所有未来冰淇淋活动。它有效,但它不会自动 select 第一个,而是它似乎会为同一个人选择未来的任何冰淇淋事件。我可以添加的任何 ORDER BY
子句似乎都不会改变这一点。
解决这个问题的理想方法是让 DISTINCT ON
子句在每次必须过滤掉重复项时选择最小值 ic_timestamp
。
看来 DISTINCT ON
子句实际上是在 ORDER BY
子句之后。
因此,通过添加正确的顺序解决了问题:
SELECT DISTINCT ON (cv.person_id, cv.timestamp)
cv.person_id,
cv.city,
cv.timestamp,
ic.flavour as ic_flavour,
ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
ON ic.person_id = cv.person_id
AND ic.timestamp > cv.timestamp
ORDER BY cv.person_id, cv.timestamp ASC, ic.timestamp ASC -- <- this line added
由于 ice_cream_events
中没有 city
,您的查询将加入 lots of ice-在选择最早的访问之前,为每次访问创建奶油事件。我建议改为 LEFT JOIN LATERAL
,在这种情况下,如果有适当的索引支持,这会快得多:
SELECT *
FROM city_visits cv
LEFT JOIN LATERAL (
SELECT flavour AS ic_flavour, timestamp AS ic_timestamp
FROM ice_cream_events
WHERE person_id = cv.person_id
AND timestamp > cv.timestamp
ORDER BY timestamp
LIMIT 1
) ice ON true
ORDER BY cv.person_id, cv.timestamp;
LEFT [OUTER] JOIN
包括没有任何冰淇淋的访问。如果您只想带冰淇淋去拜访,请切换到 CROSS JOIN
。
在这种情况下,外部 ORDER BY
仅对结果行进行排序(不同于与 DISTINCT ON
组合时,它还决定从每组对等项中选择哪一行)。
- Select first row in each GROUP BY group?
如果表很大,一定要有适当的索引来加快速度。理想情况下,ice_cream_events (person_id, timestamp, flavour)
上的复合索引 - this 顺序中的列。在 city_visits (person_id, timestamp)
上进行外部排序。或者甚至可以在 city_visits (person_id, timestamp, city)
上允许另一个仅索引扫描。取决于你的实际情况。这个例子显然是象征性的。
- Optimize GROUP BY query to retrieve latest record per user
我有一个 table 某人在某个时间戳访问了某个城市:
city_visits:
person_id city timestamp
-----------------------------------------------
1 Paris 2017-01-01 00:00:00
1 Amsterdam 2017-01-03 00:00:00
1 Brussels 2017-01-04 00:00:00
1 London 2017-01-06 00:00:00
2 Berlin 2017-01-01 00:00:00
2 Brussels 2017-01-02 00:00:00
2 Berlin 2017-01-06 00:00:00
2 Hamburg 2017-01-07 00:00:00
另一个 table 列出了某人购买冰淇淋的时间:
ice_cream_events:
person_id flavour timestamp
-----------------------------------------------
1 Vanilla 2017-01-02 00:12:00
1 Chocolate 2017-01-05 00:18:00
2 Strawberry 2017-01-03 00:09:00
2 Caramel 2017-01-05 00:15:00
对于 city_visits
table 中的每一行,我需要加入同一个人的下一个冰淇淋活动,以及它的时间戳和口味:
desired_output:
person_id city timestamp ic_flavour ic_timestamp
---------------------------------------------------------------------------
1 Paris 2017-01-01 00:00:00 Vanilla 2017-01-02 00:12:00
1 Amsterdam 2017-01-03 00:00:00 Chocolate 2017-01-05 00:18:00
1 Brussels 2017-01-04 00:00:00 Chocolate 2017-01-05 00:18:00
1 London 2017-01-06 00:00:00 null null
2 Berlin 2017-01-01 00:00:00 Strawberry 2017-01-03 00:09:00
2 Brussels 2017-01-02 00:00:00 Strawberry 2017-01-03 00:09:00
2 Berlin 2017-01-06 00:00:00 null null
2 Hamburg 2017-01-07 00:00:00 null null
我试过以下方法:
SELECT DISTINCT ON (cv.person_id, cv.timestamp)
cv.person_id,
cv.city,
cv.timestamp,
ic.flavour as ic_flavour,
ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
ON ic.person_id = cv.person_id
AND ic.timestamp > cv.timestamp
DISTINCT ON
子句可防止每次访问城市时参加除一次以外的所有未来冰淇淋活动。它有效,但它不会自动 select 第一个,而是它似乎会为同一个人选择未来的任何冰淇淋事件。我可以添加的任何 ORDER BY
子句似乎都不会改变这一点。
解决这个问题的理想方法是让 DISTINCT ON
子句在每次必须过滤掉重复项时选择最小值 ic_timestamp
。
看来 DISTINCT ON
子句实际上是在 ORDER BY
子句之后。
因此,通过添加正确的顺序解决了问题:
SELECT DISTINCT ON (cv.person_id, cv.timestamp)
cv.person_id,
cv.city,
cv.timestamp,
ic.flavour as ic_flavour,
ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
ON ic.person_id = cv.person_id
AND ic.timestamp > cv.timestamp
ORDER BY cv.person_id, cv.timestamp ASC, ic.timestamp ASC -- <- this line added
由于 ice_cream_events
中没有 city
,您的查询将加入 lots of ice-在选择最早的访问之前,为每次访问创建奶油事件。我建议改为 LEFT JOIN LATERAL
,在这种情况下,如果有适当的索引支持,这会快得多:
SELECT *
FROM city_visits cv
LEFT JOIN LATERAL (
SELECT flavour AS ic_flavour, timestamp AS ic_timestamp
FROM ice_cream_events
WHERE person_id = cv.person_id
AND timestamp > cv.timestamp
ORDER BY timestamp
LIMIT 1
) ice ON true
ORDER BY cv.person_id, cv.timestamp;
LEFT [OUTER] JOIN
包括没有任何冰淇淋的访问。如果您只想带冰淇淋去拜访,请切换到 CROSS JOIN
。
在这种情况下,外部 ORDER BY
仅对结果行进行排序(不同于与 DISTINCT ON
组合时,它还决定从每组对等项中选择哪一行)。
- Select first row in each GROUP BY group?
如果表很大,一定要有适当的索引来加快速度。理想情况下,ice_cream_events (person_id, timestamp, flavour)
上的复合索引 - this 顺序中的列。在 city_visits (person_id, timestamp)
上进行外部排序。或者甚至可以在 city_visits (person_id, timestamp, city)
上允许另一个仅索引扫描。取决于你的实际情况。这个例子显然是象征性的。
- Optimize GROUP BY query to retrieve latest record per user