获取多个字段的不同信息，其中一些字段为 NULL

Question

我有一个 table，它有超过 6500 万行和 140 列。数据来自多个来源，至少每月提交一次。

我正在寻找一种快速的方法来仅从该数据中获取唯一的特定字段。问题是，我想将所有信息处理到 link 哪个发票是用哪个识别号发送的，它是由谁发送的。问题是，我不想迭代超过 6500 万条记录。如果我可以获得不同的值，那么我只需要处理 500 万条记录，而不是 6500 万条。请参阅下面的数据说明和 SQL Fiddle 示例

如果说客户每月向 passport_number_1, national_identity_number_1 and driving_license_1 提交 invoice_number linked，我只想要一行出现的地方。即 4 个字段必须是唯一的

如果他们提交上述内容 30 个月，然后在第 31 个月他们将 invoice_number linked 发送到 passport_number_1, national_identity_number_2 and driving_license_1，我也想选择这一行，因为 national_identity 字段是新的因此整行是唯一的

linked to 我的意思是它们出现在同一行
对于所有字段，都可能在某一点出现 Null。
'pivot/composite' 列是 invoice_number 和 submitted_by。如果其中任何一个不存在，请删除该行
我还需要将 database_id 包含在上述数据中。 IE。 primary_id 由 postgresql 数据库自动生成
唯一不需要返回的字段是 other_column 和 yet_another_column。请记住 table 有 140 列，所以不要需要他们
根据结果，创建一个新的 table 来保留这个独特的记录

请参阅此 SQL fiddle 尝试重现场景。

根据 fiddle，我希望得到如下结果：

第 1、2 和第 11 行：只能保留其中一个，因为它们正是相同的。最好是 id.
第 4 行和第 9 行：其中一个将被删除，因为它们正是一样。
第 5、7 和 8 行：将被删除，因为它们缺少 invoice_number 或 submitted_by.
结果将包含行（1、2 或 11）、3、（4 或 9）、6 和 10。

Answer 1

要从具有四个不同字段的组中获取一个代表性行（具有附加字段）：

SELECT 
distinct on (
  invoice_number
  , passport_number
  , national_id_number
  , driving_license_number
)
  * -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;

请注意，除非您指定顺序 (documentation on distinct)

，否则无法预测确切返回哪一行

编辑：

要通过 id 简单地在末尾添加 order by id 来排序这个结果是行不通的，但是可以通过使用 CTE

的 eiter 来完成

with distinct_rows as (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
)
select *
from distinct_rows
order by id;

或使原始查询成为子查询

select *
from (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
) t
order by id;

Answer 2

quick way to grab specific fields from this data only where they are unique

我不这么认为。我想你的意思是你想要 select distinct 一组来自 table 的行，其中它们不是唯一的。

据我从你的描述可以看出，你只是想要

SELECT distinct invoice_number, passport_number, 
                driving_license_number, national_id_number
FROM my_table
where invoice_number is not null
and submitted_by is not null;

在您的 SQLFiddle 示例中，它产生了 5 行。

获取多个字段的不同信息，其中一些字段为 NULL

Get distinct information across many fields some of which are NULL

sql

postgresql

distinct-on