根据匹配的客户字段(phone、电子邮件、地址)为客户分配 household_id - 重复问题

Assigning customers a household_id based off matched customer fields (phone, email, address) - trouble with duplicates

我有一个客户 table,其中每个客户都有一个唯一的 id,具体取决于他们下订单时使用的电子邮件。此外,phoneemailaddress 有单独的 table 列。如果他们的 phoneemailaddress 与客户 table 中的另一个“客户”匹配,我正在尝试将客户 ID 分组到同一个 household_id 下.

我 运行 遇到的问题是我可以将客户分组并给他们一个 household_id,但我正在努力完全删除这些客户分组在我的过滤中的重复出现。我下面查询中的最后两条评论旨在帮助解释我当前的过滤逻辑并说明它失败的地方。此逻辑适用于成对的客户,但一旦需要将 3 个或更多客户绑定到同一个 household_id,就会开始失败。有没有更好的方法来过滤这些结果,或者我是否需要添加一些额外的 CTE 来利用 min()/max() 函数和一些其他类型的分组以在此处添加更多智能?除了 rank() 之外,还有其他聪明的 window 函数可以帮助我吗?

with household as (
  select
    c1.id as parent_id,
    c2.id as child_id,
    rank() over (partition by c1.id order by c2.id) as child_number
    -- order by clause is important here to ensure lowest c2.id is always rank 1 (referenced later on in household join onto customer table)
                    
  from customer c1
    left join customer c2 on (c1.phone = c2.phone) or (c1.email = c2.email) or (c1.address = c2.address)
                            
  order by c1.id, child_number
)
                
select
  'H-' || h.parent_id as household_id, -- effectively creates a unique household_id
  h.child_id
                    
from household h
  where h.parent_id < h.child_id or (h.parent_id = h.child_id and h.child_number = 1)
  -- ^this where clause is my attempt at removing the duplicate groupings of customers
  -- it works in the instance when there is a pair of customers tied to a household_id, but when there are 3 or more it starts to fail

查看链接的图片,查看一组 3 customer_id 的家庭 cte,因为他们有匹配的 phone、电子邮件或地址而连接在一起。突出显示的行是将在上述查询的 where 子句中通过我的过滤器的内容
How my query is failing

我使用与 Julius 在评论中讨论的类似方法解决了这个问题:

  1. 使用递归 cte 将所有类似的客户分组到数组中 grp
  2. 使用 select distinct onhousehold cte
  3. 中的 cardinality() 函数将递归 grp 缩减为仅包含所有家庭成员的数组
  4. 扫描 household table 并通过取消 linked_customers 数组的嵌套并将其所有元素绑定到一个 household_id 基于最小值的单个 household_id 来删除任何重复项数组中的id

这似乎工作得很好,但我相信这个查询可以进一步简化,我很乐意欢迎任何反馈!

with recursive grp as (
    select
        c1.id,
        c1.email,
        c1.phone,
        c1.address,
        array[c1.id] as linked_customers -- initializes an array based of the id of every customer
    from customer c1
    
    union all
    
    select
        c2.id,
        c2.email,
        c2.phone,
        c2.address,
        c2.id || linked_customers
    from grp g
        join customer c2 on (g.email = c2.email or g.phone = c2.phone or g.address = c2.address)
    where c2.id <> all(linked_customers) -- ensures the same customer id that already was used in the array initialization is not being looked at again
), -- creates several similar groups as well as intermediary groups of customers

household as (
    select
        distinct on (g.id) g.linked_customers
    from grp g
    order by g.id, cardinality(linked_customers) desc
) -- extracts largest array for each customer_id (still duplicate groupings here), but only the max length arrays are being pulled out (all household members)

select
    distinct on (p.parent_id) 'H-' || parent_id as household_id,
    unnest(p.linked_customers) as child_id
from (
    select
        min(parent_id) as parent_id, -- pulls out the minimum id of each linked customers group which will remove the creation of multiple household_ids for the same customer groups in the select clause above
        h.linked_customers
    from household h, unnest(h.linked_customers) parent_id
    group by h.linked_customers
) p
order by parent_id