Postgres - select 来自多个有序行的非空非空值

Question

我需要根据优先级对来自多个来源的大量数据进行分组，但这些来源的数据质量不同 - 它们可能会丢失一些数据。任务是以尽可能完整的方式将该数据分组到一个单独的 table 中。

例如：

create table grouped_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int
);

create table raw_data (
  id serial primary key,
  type text,
  a text,
  b text,
  c int,
  priority int
);


insert into raw_data
(type, a,       b,         c,   priority)
values
('one', null,    '',        123, 1),
('one', 'foo',   '',        456, 2),
('one', 'bar',   'baz',     789, 3),
('two', null,    'two-b',   11,  3),
('two', '',      '',        33,  2),
('two', null,    'two-bbb', 22,  1);

现在我需要按 type 对记录进行分组，按 priority 排序，取第一个非 null 和非空值，并将其放入 grouped_data。在这种情况下，组 one 的 a 的值将是 foo，因为保存该值的行比具有 bar 的行具有更高的优先级。而 c 应该是 123，因为它的优先级最高。对于组 two 也是如此，对于每一列，我们采用非空、非空且具有最高优先级的数据，如果没有实际数据，则回退到 null。

最后，grouped_data预计会有以下内容：

('one', 'foo', 'baz',     123),
('two', null,  'two-bbb', 22)

我尝试过分组、子选择、MERGE、交叉连接...唉，我对 PostgreSQL 的了解还不足以让它工作。我也想避免的一件事是逐一浏览列，因为在现实世界中只有几十个列可以使用...

A link 到 fiddle 我一直用来搞乱这个：http://sqlfiddle.com/#!17/76699/1

更新：

谢谢大家！ Oleksii Tambovtsev 的解决方案是最快的。在一组与真实案例非常相似的数据（200 万条记录，约 30 个字段）上，只需要 20 秒就可以生成完全相同的数据集，而这些数据以前是以编程方式生成的，需要 20 多分钟。

eshirvana 的解决方案在 95 秒内执行相同的操作，Steve Kass 的解决方案在 125 秒内执行相同操作，Stefanov.sm - 308 秒（仍然比编程快得多！）

谢谢大家:)

Answer 1

您可以使用 window 函数 first_value:

select distinct 
    type 
  , first_value(a) over (partition by type order by nullif(a,'') is null, priority) as a
  , first_value(b) over (partition by type order by nullif(b,'') is null, priority)  as b
  , first_value(c) over (partition by type order by priority) as c
from raw_data

Answer 2

你应该试试这个：

SELECT
       type,
       (array_agg(a ORDER BY priority ASC) FILTER (WHERE a IS NOT NULL AND a != ''))[1] as a,
       (array_agg(b ORDER BY priority ASC) FILTER (WHERE b IS NOT NULL AND b != ''))[1] as b,
       (array_agg(c ORDER BY priority ASC) FILTER (WHERE c IS NOT NULL))[1] as c
FROM raw_data GROUP BY type ORDER BY type;

Answer 3

select distinct on (type) type, 
  first_value(a) over (partition by type order by (nullif(a, '') is null), priority) a, 
  first_value(b) over (partition by type order by (nullif(b, '') is null), priority) b, 
  first_value(c) over (partition by type order by (c is null), priority) c
from raw_data;

Answer 4

这也应该有效。

WITH types(type) AS (
  SELECT DISTINCT
    type
  FROM raw_data
)
SELECT
  type,
  (SELECT a FROM raw_data WHERE a > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS a,
  (SELECT b FROM raw_data WHERE b > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS b,
  (SELECT c FROM raw_data WHERE c IS NOT NULL AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS c
FROM types
ORDER BY type;

Postgres - select 来自多个有序行的非空非空值

Postgres - select non-blank non-null values from multiple ordered rows

sql

postgresql

grouping