在 PostgreSQL 中对相似行和计数组进行分组

Group similar rows and count groups in PostgreSQL

我有一个 table 这样的:

number | info | side
--------------------
     1 |  foo |    a
     2 |  bar |    a
     3 |  bar |    a
     4 |  baz |    a
     5 |  foo |    a
     6 |  bar |    b
     7 |  bar |    b
     8 |  foo |    a
     9 |  bar |    a
    10 |  baz |    a

我想得到多少次bar group/package(例如第2,3行是一组,第6,7行是一组,第9行也是一组)根据 side 出现在 info 列中。我被卡住了,因为我真的不知道该做什么 google。每当我搜索 group rowsmerge rows 之类的内容时,我总是最终会找到有关 group by 功能的信息。

不过我想我需要某种 window 功能。

这是我想要实现的目标:

bar_a | bar_b
-------------
    2 |     1

使用lag()确定组的第一行:

select 
    number, info, side, 
    lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
from my_table
order by 1;

 number | info | side | start_of_group 
--------+------+------+----------------
      1 | foo  | a    | t
      2 | bar  | a    | t
      3 | bar  | a    | f
      4 | baz  | a    | t
      5 | foo  | a    | t
      6 | bar  | b    | t
      7 | bar  | b    | f
      8 | foo  | a    | t
      9 | bar  | a    | t
     10 | baz  | a    | t
(10 rows)

聚合并过滤上述结果以获得所需的输出:

select concat(info, '_', side) as info_side, count(*)
from (
    select 
        info, side, 
        lag(info || side, 1, '') over (order by number) <> info || side as start_of_group
    from my_table
    ) s
where info = 'bar' and start_of_group
group by 1
order by 1;

 info_side | count 
-----------+-------
 bar_a     |     2
 bar_b     |     1
(2 rows)    

如果我理解正确的话,这是一个 "gaps-and-islands" 问题的核心。对于这个版本,行号的差异应该可以正常工作。

select sum( (side = 'a')::int) as num_a,
       sum( (side = 'b')::int) as num_b
from (select info, side, count(*) as cnt
      from (select t.*,
                   row_number() over (order by number) as seqnum,
                   row_number() over (partition by info, side order by number) as seqnum_bs
            from t
           ) t
      where info = 'bar'
      group by info, size, (seqnum - seqnum_bs)
     ) si;

你可以用一个singlewindow函数来凑合,这应该是最快的选择:

SELECT side, count(*) AS count
FROM  (
   SELECT side, grp
   FROM  (
      SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
      FROM   tbl
      WHERE  info = 'bar'
      ) sub1
   GROUP BY 1, 2
   ) sub2
GROUP BY 1
ORDER BY 1;  -- optional

或更短,也许不会更快:

SELECT side, count(DISTINCT grp) AS count
FROM  (
   SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
   FROM   tbl
   WHERE  info = 'bar'
   ) sub
GROUP BY 1
ORDER BY 1;  -- optional

"trick"是相邻行组成一组(grp)的数字是连续的。当从所有行 (number) 的 运行 计数中减去 side 上分区的 运行 计数时,"group" 的成员得到相同的 grp个数.

如果您的序列栏中有 间隙 number,您的演示中并非如此,但通常存在间隙(您实际上想忽略此类差距?!),然后在子查询中使用 row_number() OVER (ORDER BY number) 而不是先使用 number 来缩小差距:

SELECT side, count(DISTINCT grp) AS count
FROM  (
   SELECT side, number - row_number() OVER (PARTITION BY side ORDER BY number) AS grp
   FROM  (SELECT info, side, row_number() OVER (ORDER BY number) AS number FROM tbl) tbl1
   WHERE  info = 'bar'
   ) sub2
GROUP BY 1
ORDER BY 1;  -- optional

SQL Fiddle(带扩展测试用例)

相关: