如何根据元素是否具有连续值对元素集合进行分组?
How do I group a collection of elements by whether or not they have consecutive values?
所以给定一个像下面这样的 table,我想抓取 id
至少连续三年的行。
+---------+--------+
| id | year |
+------------------+
| 2 | 2003 |
| 2 | 2004 |
| 1 | 2005 |
| 2 | 2005 |
| 1 | 2007 |
| 1 | 2008 |
+---------+--------+
这里的结果当然是:
+---------+
| id |
+---------+
| 2 |
+---------+
任何关于我如何构建查询来执行此操作的输入都很好。
您可以使用JOIN
方法(自加入):
SELECT t1.id
FROM tbl t1
JOIN tbl t2 ON t2.year = t1.year + 1
AND t1.id = t2.id
JOIN tbl t3 ON t3.year = t1.year + 2
AND t1.id = t3.id
当您在 id 字段上至少有一个索引时,这个方法可以工作并且速度很快:
WITH t1 AS (
SELECT *
FROM (VALUES
(2,2003),
(2,2004),
(1,2005),
(2,2005),
(1,2007),
(1,2008)
) v(id, year)
)
SELECT DISTINCT t1.id
FROM t1 -- your tablename
JOIN t1 AS t2 ON t1.id = t2.id AND t1.year + 1 = t2.year
JOIN t1 AS t3 ON t1.id = t3.id AND t1.year + 2 = t3.year;
组合(id, year)
是UNIQUE
通常使用 PRIMARY KEY
或 UNIQUE
约束或唯一索引来保证。
这是针对任意最小连续行数的通用解决方案:
SELECT DISTINCT id
FROM (
SELECT id, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM tbl
) sub
GROUP BY id, grp
HAVING count(*) > 2; -- minimum: 3
这应该比反复自加入要快,因为只需要在table 基础上单次扫描。使用 EXPLAIN ANALYZE
.
测试性能
详细解释的相关回答:
组合(id, year)
不是UNIQUE
您可以在第一步中使其独一无二。
SELECT DISTINCT id
FROM (
SELECT id, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM tbl
<b>GROUP BY id, year</b>
) sub
GROUP BY id, grp
HAVING count(*) > 2; -- minimum: 3
或者您可以使用 window 函数 dense_rank()
而不是 row_number()
然后 count(DISTINCT year)
,但我看不出这种方法有什么好处。
了解 SELECT
查询中的 事件顺序 是关键:
- Best way to get result count before LIMIT was applied
所以给定一个像下面这样的 table,我想抓取 id
至少连续三年的行。
+---------+--------+
| id | year |
+------------------+
| 2 | 2003 |
| 2 | 2004 |
| 1 | 2005 |
| 2 | 2005 |
| 1 | 2007 |
| 1 | 2008 |
+---------+--------+
这里的结果当然是:
+---------+
| id |
+---------+
| 2 |
+---------+
任何关于我如何构建查询来执行此操作的输入都很好。
您可以使用JOIN
方法(自加入):
SELECT t1.id
FROM tbl t1
JOIN tbl t2 ON t2.year = t1.year + 1
AND t1.id = t2.id
JOIN tbl t3 ON t3.year = t1.year + 2
AND t1.id = t3.id
当您在 id 字段上至少有一个索引时,这个方法可以工作并且速度很快:
WITH t1 AS (
SELECT *
FROM (VALUES
(2,2003),
(2,2004),
(1,2005),
(2,2005),
(1,2007),
(1,2008)
) v(id, year)
)
SELECT DISTINCT t1.id
FROM t1 -- your tablename
JOIN t1 AS t2 ON t1.id = t2.id AND t1.year + 1 = t2.year
JOIN t1 AS t3 ON t1.id = t3.id AND t1.year + 2 = t3.year;
组合(id, year)
是UNIQUE
通常使用 PRIMARY KEY
或 UNIQUE
约束或唯一索引来保证。
这是针对任意最小连续行数的通用解决方案:
SELECT DISTINCT id
FROM (
SELECT id, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM tbl
) sub
GROUP BY id, grp
HAVING count(*) > 2; -- minimum: 3
这应该比反复自加入要快,因为只需要在table 基础上单次扫描。使用 EXPLAIN ANALYZE
.
详细解释的相关回答:
组合(id, year)
不是UNIQUE
您可以在第一步中使其独一无二。
SELECT DISTINCT id
FROM (
SELECT id, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM tbl
<b>GROUP BY id, year</b>
) sub
GROUP BY id, grp
HAVING count(*) > 2; -- minimum: 3
或者您可以使用 window 函数 dense_rank()
而不是 row_number()
然后 count(DISTINCT year)
,但我看不出这种方法有什么好处。
了解 SELECT
查询中的 事件顺序 是关键:
- Best way to get result count before LIMIT was applied