根据匹配的客户字段(phone、电子邮件、地址)为客户分配 household_id - 重复问题
Assigning customers a household_id based off matched customer fields (phone, email, address) - trouble with duplicates
我有一个客户 table,其中每个客户都有一个唯一的 id
,具体取决于他们下订单时使用的电子邮件。此外,phone
、email
和 address
有单独的 table 列。如果他们的 phone
、email
或 address
与客户 table 中的另一个“客户”匹配,我正在尝试将客户 ID 分组到同一个 household_id
下.
我 运行 遇到的问题是我可以将客户分组并给他们一个 household_id
,但我正在努力完全删除这些客户分组在我的过滤中的重复出现。我下面查询中的最后两条评论旨在帮助解释我当前的过滤逻辑并说明它失败的地方。此逻辑适用于成对的客户,但一旦需要将 3 个或更多客户绑定到同一个 household_id,就会开始失败。有没有更好的方法来过滤这些结果,或者我是否需要添加一些额外的 CTE 来利用 min()/max() 函数和一些其他类型的分组以在此处添加更多智能?除了 rank() 之外,还有其他聪明的 window 函数可以帮助我吗?
with household as (
select
c1.id as parent_id,
c2.id as child_id,
rank() over (partition by c1.id order by c2.id) as child_number
-- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)
from customer c1
left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)
order by c1.id, child_number
)
select
'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
h.child_id
from household h
where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
-- ^this where clause is my attempt at removing the duplicate groupings of customers
-- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail
查看链接的图片,查看一组 3 customer_id 的家庭 cte,因为他们有匹配的 phone、电子邮件或地址而连接在一起。突出显示的行是将在上述查询的 where 子句中通过我的过滤器的内容
How my query is failing
我使用与 Julius 在评论中讨论的类似方法解决了这个问题:
- 使用递归 cte 将所有类似的客户分组到数组中
grp
- 使用
select distinct on
和 household
cte 中的 cardinality()
函数将递归 grp
缩减为仅包含所有家庭成员的数组
- 扫描
household
table 并通过取消 linked_customers
数组的嵌套并将其所有元素绑定到一个 household_id
基于最小值的单个 household_id
来删除任何重复项数组中的id
这似乎工作得很好,但我相信这个查询可以进一步简化,我很乐意欢迎任何反馈!
with recursive grp as (
select
c1.id,
c1.email,
c1.phone,
c1.address,
array[c1.id] as linked_customers -- initializes an array based of the id of every customer
from customer c1
union all
select
c2.id,
c2.email,
c2.phone,
c2.address,
c2.id || linked_customers
from grp g
join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers
household as (
select
distinct on (g.id) g.linked_customers
from grp g
order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)
select
distinct on (p.parent_id) 'H-' || parent_id as household_id,
unnest(p.linked_customers) as child_id
from (
select
min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
h.linked_customers
from household h, unnest(h.linked_customers) parent_id
group by h.linked_customers
) p
order by parent_id
我有一个客户 table,其中每个客户都有一个唯一的 id
,具体取决于他们下订单时使用的电子邮件。此外,phone
、email
和 address
有单独的 table 列。如果他们的 phone
、email
或 address
与客户 table 中的另一个“客户”匹配,我正在尝试将客户 ID 分组到同一个 household_id
下.
我 运行 遇到的问题是我可以将客户分组并给他们一个 household_id
,但我正在努力完全删除这些客户分组在我的过滤中的重复出现。我下面查询中的最后两条评论旨在帮助解释我当前的过滤逻辑并说明它失败的地方。此逻辑适用于成对的客户,但一旦需要将 3 个或更多客户绑定到同一个 household_id,就会开始失败。有没有更好的方法来过滤这些结果,或者我是否需要添加一些额外的 CTE 来利用 min()/max() 函数和一些其他类型的分组以在此处添加更多智能?除了 rank() 之外,还有其他聪明的 window 函数可以帮助我吗?
with household as (
select
c1.id as parent_id,
c2.id as child_id,
rank() over (partition by c1.id order by c2.id) as child_number
-- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)
from customer c1
left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)
order by c1.id, child_number
)
select
'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
h.child_id
from household h
where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
-- ^this where clause is my attempt at removing the duplicate groupings of customers
-- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail
查看链接的图片,查看一组 3 customer_id 的家庭 cte,因为他们有匹配的 phone、电子邮件或地址而连接在一起。突出显示的行是将在上述查询的 where 子句中通过我的过滤器的内容
How my query is failing
我使用与 Julius 在评论中讨论的类似方法解决了这个问题:
- 使用递归 cte 将所有类似的客户分组到数组中
grp
- 使用
select distinct on
和household
cte 中的 - 扫描
household
table 并通过取消linked_customers
数组的嵌套并将其所有元素绑定到一个household_id
基于最小值的单个household_id
来删除任何重复项数组中的id
cardinality()
函数将递归 grp
缩减为仅包含所有家庭成员的数组
这似乎工作得很好,但我相信这个查询可以进一步简化,我很乐意欢迎任何反馈!
with recursive grp as (
select
c1.id,
c1.email,
c1.phone,
c1.address,
array[c1.id] as linked_customers -- initializes an array based of the id of every customer
from customer c1
union all
select
c2.id,
c2.email,
c2.phone,
c2.address,
c2.id || linked_customers
from grp g
join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers
household as (
select
distinct on (g.id) g.linked_customers
from grp g
order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)
select
distinct on (p.parent_id) 'H-' || parent_id as household_id,
unnest(p.linked_customers) as child_id
from (
select
min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
h.linked_customers
from household h, unnest(h.linked_customers) parent_id
group by h.linked_customers
) p
order by parent_id