过滤电子邮件和姓名,然后在 PostgreSQL 12 上使用 JSON 在两列中删除重复项

filter emails and names and then de-duplicate in two columns using JSON on PostgreSQL 12

我有 emails table,其中有 senderreporter 列。我想在这些列中搜索给定参数和 return 个唯一值。

让我用例子来解释。这是我的 table 和记录:

CREATE TABLE public.emails (
  id                bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
    (MAXVALUE 9223372036854775807),
  sender            jsonb NOT NULL,
  reporter      jsonb not null
);

insert into emails (sender, reporter) VALUES ('[{"email": "dennis1@example.com", "name": "dennis1"}]', '[]');
insert into emails (sender, reporter) VALUES ('[{"email": "dennis2@example.com", "name": "dennis1"}]', '[{"email": "john@example.com", "name": "john"}, {"email": "dennis1@example.com", "name": "dennis1"}, {"email": "dennis2@example.com", "name": "dennis2"}]');
insert into emails (sender, reporter) VALUES ('[{"email": "dennis1@example.com", "name": "dennis1"}]', '[]');
insert into emails (sender, reporter) VALUES ('[{"email": "dennis1@example.com", "name": "dennis1"}]', '[]');

我想获取电子邮件地址和姓名。我也想避免上当受骗。只有一封电子邮件和一个名字。我也不想将其作为数组获取,而是每行一封电子邮件和姓名。

正在搜索john

SELECT
* /* i don't know what to put here pr merge with reporters */
FROM "emails" AS "e"
WHERE (EXISTS (SELECT
    *
  FROM JSONB_ARRAY_ELEMENTS_TEXT("e"."sender") AS "e" ("email")
  WHERE ("e"."email" ~* 'john' or "e"."name" ~* 'john'))
);

john 的预期结果:

email                 name
john@example.com      john

正在搜索 ``(空):

SELECT
* /* i don't know what to put here pr merge with reporters */
FROM "emails" AS "e"
WHERE (EXISTS (SELECT
    *
  FROM JSONB_ARRAY_ELEMENTS_TEXT("e"."sender") AS "e" ("email")
  WHERE ("e"."email" ~* '' or "e"."name" ~* ''))
);

``(空)的预期结果:

email                 name
john@example.com      john
dennis1@example.com   dennis1
dennis2@example.com   dennis2

dennis2senderreporter中都有,因此只需要其中一个。没有骗子。

事实上,这里有一个问题。如果 senderreporter 列至少有一个 json 对象(不是 json 数组),那么此查询也会失败。

错误:cannot extract elements from an object

那是另外一回事了。

在这种情况下我怎样才能实现我的目标?

演示:https://dbfiddle.uk/?rdbms=postgres_12&fiddle=1bf9c5f83f5104e2392c31984cb4e939

在搜索之前规范化您的数据,然后使用 distinct on () 子句删除重复数据:

with cte as (select x ->> 'name' as name, x ->> 'email' as email
from emails as e, jsonb_array_elements(e.sender || e.reporter) as x)
select distinct on (email) * from cte where 
  name ~* '' or email ~* ''
  --name ~* 'john' or email ~* 'john'
order by email;

Demo

请注意,它将始终扫描整个 table,在这种情况下没有适用的索引。想想schema normalization.