对 Postgres 中的更新插入执行 2 个唯一约束

Question

我有一个 table 用于保存联系人数据

                                                Table "public.person"
            Column             |           Type           | Collation | Nullable |              Default               
-------------------------------+--------------------------+-----------+----------+------------------------------------
 id                            | integer                  |           | not null | nextval('person_id_seq'::regclass)
 full_name                     | character varying        |           |          | 
 role                          | character varying        |           |          | 
 first_name                    | character varying        |           |          | 
 last_name                     | character varying        |           |          | 
 linkedin_slug                 | character varying        |           |          | 
 email                         | character varying        |           |          | 
 domain                        | character varying        |           |          | 
 created_at                    | timestamp with time zone |           |          | now()
 updated_at                    | timestamp with time zone |           |          | now()
Indexes:
    "pk_person" PRIMARY KEY, btree (id)
    "ix_person_domain" btree (domain)
    "ix_person_email" btree (email)
    "ix_person_updated_at" btree (updated_at)
    "uq_person_full_name_domain" UNIQUE CONSTRAINT, btree (full_name, domain)

我从多个来源向此 table 添加数据。一些来源有关于人的 Linkedin 个人资料数据，其他来源有电子邮件数据。有时全名并不相同，即使它们指的是同一个人。

而且我想做 upserts 没有重复的数据。现在我使用 full_name, domain 上的约束。我知道这过于简单化了，因为同一家公司可能有两个全名相同的人，但目前这不是问题。

当一个人在我使用的不同数据源中有不同的全名，但 Linkedin 个人资料相同，所以我知道这是同一个人时，问题就来了。

或者当它们关联到同一公司的 2 个域时。

在那些情况下，对于某些人，我最终得到了重复的行。例如：

full_name	domain	linkedin_slug
Raffi SARKISSIAN	getlago.com	sarkissianraffi
Raffi Sarkissian	getlago.com	sarkissianraffi

这是一个微不足道的问题，可以通过对 lower(full_name), domain 的约束来解决，但在某些情况下姓氏不相同（在许多国家/地区，人们有不止 1 个姓氏，他们有时可能不会全部使用它们）。

另一个例子

full_name	domain	linkedin_slug
Amir Manji	tenjin.com	amirmanji
Amir Manji	tenjin.io	amirmanji

理想情况下，我希望能够在 Postgres 中同时执行超过 1 个约束，但我已经看到。我不t/can不在(full_name, domain, linkedin_slug) 上创建唯一约束。并且接受的答案的解决方案对我的用例来说不是很好，因为我有比那个例子更多的 cols 并且我必须为每个数据源编写不同的 upsert 函数（并非所有数据源都具有相同的属性）

我在想的是在插入新数据后制作一个脚本来删除重复信息 'manually'，但我不确定是否有更好的方法来解决这个问题。

你会怎么做？

Answer 1

我们不能对列的函数创建唯一约束，但我们可以对作为列函数的虚拟列创建唯一约束，例如 LOWER()。对于域，我创建了一个虚拟列，其中包含第一点之前的部分。然后这会受到唯一约束。
注意：Postgres 12 或更高版本支持虚拟列。通过这些方式我们检测

'Joe BLOGGS' 是 Joe Bloggs
hello.UK 重复 hello.com
invalid 不是 e-mail 地址。
检查 e-mail 地址是否有效很复杂。一个简单的检查是否有一个 @ 后跟一个点可以避免电话号码等
您将必须确定哪些约束是可执行的，哪些可能会阻止您输入应该可接受的数据。

create table person (
 id                             serial     not null       ,
 full_name                        varchar(25)                 ,
 role                             varchar(25)                 ,
 first_name                       varchar(25)                 ,
 last_name                        varchar(25)                 ,
 linkedin_slug                    varchar(25)                 ,
 email                            varchar(25)                 ,
 domain                           varchar(25)                 ,
 created_at                        timestamp  default  now()  ,
 updated_at                        timestamp  default  now()  ,
 domainRoot varchar(25) GENERATED ALWAYS AS ( LEFT(domain, STRPOS(domain,'.')-1)) STORED,
 l_fname varchar(25) GENERATED ALWAYS AS ( LOWER(full_name)) STORED,
 CONSTRAINT "pk_person" PRIMARY KEY (id),
 CONSTRAINT  "uq_person_full_name_domain" UNIQUE (full_name, domain),
 CONSTRAINT  "uq_lower_full_name" UNIQUE (l_fname),
 CONSTRAINT "uq_email" UNIQUE (email),
 CONSTRAINT "ck_valid_email" CHECK (email LIKE '%@%.%'),
 CONSTRAINT "uq_domain_root" UNIQUE (domainRoot)
);

insert into person (full_name) values ('Joe Bloggs'),('Mrs Brown')

insert into person (full_name) values ('Joe BLOGGS')

ERROR:  duplicate key value violates unique constraint "uq_lower_full_name"

详细信息：密钥 (l_fname)=(joe bloggs) 已经存在。

update person set email = 'invalid', updated_at = now()

ERROR:  new row for relation "person" violates check constraint "ck_valid_email"

详细信息：失败行包含 (1, Joe Bloggs, null, null, null, null, invalid, null, 2022-04-08 16:22:03.749316, 2022-04-08 16:22:03.751928，null，乔博客）。

update person set email = 'ok@dom.com' where id = 2;

update person set domain = 'hello.com' , updated_at = now() where id = 1;

update person set domain = 'hello.UK' where id = 2

ERROR:  duplicate key value violates unique constraint "uq_domain_root"

详细信息：密钥 (domainroot)=(hello) 已经存在。

SELECT * FROM person

id | full_name  | role | first_name | last_name | linkedin_slug | email      | domain    | created_at                 | updated_at                 | domainroot | l_fname   
-: | :--------- | :--- | :--------- | :-------- | :------------ | :--------- | :-------- | :------------------------- | :------------------------- | :--------- | :---------
 2 | Mrs Brown  | null | null       | null      | null          | ok@dom.com | null      | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.749316 | null       | mrs brown 
 1 | Joe Bloggs | null | null       | null      | null          | null       | hello.com | 2022-04-08 16:22:03.749316 | 2022-04-08 16:22:03.753807 | hello      | joe bloggs

db<>fiddle here

Answer 2

更新：我最终通过首先执行强制执行 full_name, domain 的更新插入来做到这一点，然后在 linkedin_slug 运行上进行重复数据删除，一个基本上按 linkedin_slug 分组的脚本和获取不为空的任何值：

    SELECT
        max(id) id
        , max(full_name) full_name
        , max("role") "role"
        , max(first_name) first_name
        , max(last_name) last_name
        , linkedin_slug
        , max(linkedin_id) linkedin_id
        , max(email) email
        , max("domain") "domain"
        , max(yc_bio) yc_bio
        , min(created_at) created_at
        , now() updated_at
        , max(extrapolated_email_confidence) extrapolated_email_confidence
        , max(email_status) email_status
        , max(email_searched_on_apollo::text)::bool email_searched_on_apollo
    FROM person
    GROUP BY linkedin_slug
    HAVING count(*) > 1

然后使用此子查询中的数据更新原始 table。

完整的要点是here

对 Postgres 中的更新插入执行 2 个唯一约束

Enforce 2 unique constraints on upserts in Postgres

sql

postgresql

constraints

upsert