在 PostgreSQL 中对相似行和计数组进行分组
Group similar rows and count groups in PostgreSQL
我有一个 table 这样的:
number | info | side
--------------------
1 | foo | a
2 | bar | a
3 | bar | a
4 | baz | a
5 | foo | a
6 | bar | b
7 | bar | b
8 | foo | a
9 | bar | a
10 | baz | a
我想得到多少次bar
group/package(例如第2,3行是一组,第6,7行是一组,第9行也是一组)根据 side
出现在 info
列中。我被卡住了,因为我真的不知道该做什么 google。每当我搜索 group rows
或 merge rows
之类的内容时,我总是最终会找到有关 group by
功能的信息。
不过我想我需要某种 window 功能。
这是我想要实现的目标:
bar_a | bar_b
-------------
2 | 1
使用lag()
确定组的第一行:
select
number, info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
order by 1;
number | info | side | start_of_group
--------+------+------+----------------
1 | foo | a | t
2 | bar | a | t
3 | bar | a | f
4 | baz | a | t
5 | foo | a | t
6 | bar | b | t
7 | bar | b | f
8 | foo | a | t
9 | bar | a | t
10 | baz | a | t
(10 rows)
聚合并过滤上述结果以获得所需的输出:
select concat(info, '_', side) as info_side, count(*)
from (
select
info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
) s
where info = 'bar' and start_of_group
group by 1
order by 1;
info_side | count
-----------+-------
bar_a | 2
bar_b | 1
(2 rows)
如果我理解正确的话,这是一个 "gaps-and-islands" 问题的核心。对于这个版本,行号的差异应该可以正常工作。
select sum( (side = 'a')::int) as num_a,
sum( (side = 'b')::int) as num_b
from (select info, side, count(*) as cnt
from (select t.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by info, side order by number) as seqnum_bs
from t
) t
where info = 'bar'
group by info, size, (seqnum - seqnum_bs)
) si;
你可以用一个singlewindow函数来凑合,这应该是最快的选择:
SELECT side, count(*) AS count
FROM (
SELECT side, grp
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub1
GROUP BY 1, 2
) sub2
GROUP BY 1
ORDER BY 1; -- optional
或更短,也许不会更快:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub
GROUP BY 1
ORDER BY 1; -- optional
"trick"是相邻行组成一组(grp
)的数字是连续的。当从所有行 (number
) 的 运行 计数中减去 side
上分区的 运行 计数时,"group" 的成员得到相同的 grp
个数.
如果您的序列栏中有 间隙 number
,您的演示中并非如此,但通常存在间隙(您实际上想忽略此类差距?!),然后在子查询中使用 row_number() OVER (ORDER BY number)
而不是先使用 number
来缩小差距:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM (SELECT info, side, row_number() OVER (ORDER BY number) AS number FROM tbl) tbl1
WHERE info = 'bar'
) sub2
GROUP BY 1
ORDER BY 1; -- optional
SQL Fiddle(带扩展测试用例)
相关:
我有一个 table 这样的:
number | info | side
--------------------
1 | foo | a
2 | bar | a
3 | bar | a
4 | baz | a
5 | foo | a
6 | bar | b
7 | bar | b
8 | foo | a
9 | bar | a
10 | baz | a
我想得到多少次bar
group/package(例如第2,3行是一组,第6,7行是一组,第9行也是一组)根据 side
出现在 info
列中。我被卡住了,因为我真的不知道该做什么 google。每当我搜索 group rows
或 merge rows
之类的内容时,我总是最终会找到有关 group by
功能的信息。
不过我想我需要某种 window 功能。
这是我想要实现的目标:
bar_a | bar_b
-------------
2 | 1
使用lag()
确定组的第一行:
select
number, info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
order by 1;
number | info | side | start_of_group
--------+------+------+----------------
1 | foo | a | t
2 | bar | a | t
3 | bar | a | f
4 | baz | a | t
5 | foo | a | t
6 | bar | b | t
7 | bar | b | f
8 | foo | a | t
9 | bar | a | t
10 | baz | a | t
(10 rows)
聚合并过滤上述结果以获得所需的输出:
select concat(info, '_', side) as info_side, count(*)
from (
select
info, side,
lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
) s
where info = 'bar' and start_of_group
group by 1
order by 1;
info_side | count
-----------+-------
bar_a | 2
bar_b | 1
(2 rows)
如果我理解正确的话,这是一个 "gaps-and-islands" 问题的核心。对于这个版本,行号的差异应该可以正常工作。
select sum( (side = 'a')::int) as num_a,
sum( (side = 'b')::int) as num_b
from (select info, side, count(*) as cnt
from (select t.*,
row_number() over (order by number) as seqnum,
row_number() over (partition by info, side order by number) as seqnum_bs
from t
) t
where info = 'bar'
group by info, size, (seqnum - seqnum_bs)
) si;
你可以用一个singlewindow函数来凑合,这应该是最快的选择:
SELECT side, count(*) AS count
FROM (
SELECT side, grp
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub1
GROUP BY 1, 2
) sub2
GROUP BY 1
ORDER BY 1; -- optional
或更短,也许不会更快:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM tbl
WHERE info = 'bar'
) sub
GROUP BY 1
ORDER BY 1; -- optional
"trick"是相邻行组成一组(grp
)的数字是连续的。当从所有行 (number
) 的 运行 计数中减去 side
上分区的 运行 计数时,"group" 的成员得到相同的 grp
个数.
如果您的序列栏中有 间隙 number
,您的演示中并非如此,但通常存在间隙(您实际上想忽略此类差距?!),然后在子查询中使用 row_number() OVER (ORDER BY number)
而不是先使用 number
来缩小差距:
SELECT side, count(DISTINCT grp) AS count
FROM (
SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
FROM (SELECT info, side, row_number() OVER (ORDER BY number) AS number FROM tbl) tbl1
WHERE info = 'bar'
) sub2
GROUP BY 1
ORDER BY 1; -- optional
SQL Fiddle(带扩展测试用例)
相关: