连通分量

Connected Components

我有一组数据是通过将相似的 sub-items 匹配在一起而创建的,然后通过 "category" 对这些相似的项目进行分组。

现在,生成的类别必须以这样的方式进行匹配,即在每个 "group_id" 中将相关类别组合在一起。在下面的例子中,一个匹配是A->B->C->D->E->F->G,这是通过行递归得到的。

我已经发布了我的 ,它适用于这个简单的数据集,但是因为实际数据集包含多达 100 万行,并且每个 "group_id," 可能有多达 60 个类别此查询导致真实数据出现 "out of spool space" 错误。

请注意:

正确答案会

Sample Input:

Desired Output:

这有效,但会导致实际数据出现 "out of spool space" 问题。

架构创建:

CREATE VOLATILE TABLE match_detail (
    group_id bigint
    , category_1 varchar(255)
    , category_2 varchar(255)
) PRIMARY INDEX (group_id)
    ON COMMIT PRESERVE ROWS;

INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');

查询:

WITH

 related_cats AS (
    SELECT
        group_id
        , category_1
        , SUM(DISTINCT bitflag) As bitflag_total
        , DENSE_RANK() OVER (
            PARTITION BY
                group_id
            ORDER BY
                group_id
                , bitflag_total
        ) As bitflag_group

    FROM bitflags

    GROUP BY 1, 2
)

, bitflags As (
    SELECT
        DISTINCT
            group_id
            , category_1
            , category_2
            , CAST
            (
                2 ** (DENSE_RANK() OVER (
                    PARTITION BY
                        group_id

                    ORDER BY group_id
                        , category_2) - 1)

                As bigint
            ) As bitflag

    FROM cat_join
    WHERE depth = 1
)

, RECURSIVE cat_join AS (
    SELECT DISTINCT
        c1.group_id
        , c1.category_1
        , c1.category_2
        , CAST
        (
            n.num_categories - 1
            As integer
        ) As max_depth
        , CASE
            WHEN c1.category_1 = c1.category_2 THEN 1
            ELSE max_depth
        END As depth
        , 1 As recursion

    FROM matches c1
    INNER JOIN num_categories n
        ON c1.group_id = n.group_id

    UNION ALL

    SELECT
        r1.group_id
        , r1.category_1
        , r2.category_2
        , r1.max_depth
        , CASE
            WHEN r1.category_1 = r1.category_2 THEN 1
            WHEN r1.category_1 = r2.category_2 THEN 1
            ELSE r1.depth - 1
        END As cur_depth
        , recursion + 1

        FROM cat_join r1
        INNER JOIN matches r2
            ON r1.group_id = r2.group_id
            AND r1.category_2 = r2.category_1

        WHERE
        r1.category_1 <> r2.category_2
        AND r1.category_1 <> r2.category_1
        AND
            (r1.depth - 1) > 0
)

, matches AS (
    SELECT
        d.group_id
        , d.category_1
        , d.category_2

    FROM match_detail d
    GROUP BY d.group_id, d.category_1, d.category_2
)

, num_categories AS (
    SELECT
        u.group_id
        , COUNT(DISTINCT u.category_2) AS num_categories

    FROM categories u
    GROUP BY u.group_id
)

, categories AS (
    SELECT DISTINCT
        u.group_id
        , u.category_1
        , u.category_2

    FROM match_detail u
)

SELECT *
FROM related_cats

您需要一个递归方法,但是您的 WITH RECURSIVE 产生了一个巨大的中间结果,这导致 不再有假脱机

对于类似的过程,我使用了以下方法(最初在存储过程中使用 WHILE 循环):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

现在重复以下更新,直到 0 rows processed

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

获取您最终需要的相关类别:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

为了与您的预期结果完全匹配,您需要添加 DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt

此解决方案的概念是在遍历边缘时避免循环。
它是通过在 运行 上计算路径的位图并避免添加其 category_2 已经在位图中的边来完成的。
所有路径都以自引用边开始(例如'A'--'A')以获取第一个节点位图。
该方案将迭代子查询(edges)的结果记录数减少到70条。


该解决方案的局限性在于需要提前定义位图大小。
varbinary 文字中的每个数字代表 4 位。
64 位 = '0000000000000000'xb
128 位 = '00000000000000000000000000000000'xb
等等


with        category_group_bitmap (group_id,category_1,bitflag_group)
            as 
            (
                select      group_id
                           ,category_1

                           ,dense_rank () over 
                            (
                                partition by    group_id 
                                order by        sum (distinct category_2_n)
                            ) as bitflag_group

                from        edges

                group by    group_id
                           ,category_1
            )  

           ,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
            as
            (
                select      1                                           as n
                           ,m.group_id
                           ,m.category_1
                           ,m.category_2
                           ,gc.categories_num
                           ,setbit ('0000000000000000'xb,m.category_2_n)  as bitmap
                           ,m.category_2_n

                from                    match_detail_category_2_n   as m

                            join        group_categories_num        as gc

                            on          gc.group_id =
                                        m.group_id

                where       m.category_1    =
                            m.category_2

                union all

                select      e.n + 1                             as n
                           ,e.group_id
                           ,e.category_1
                           ,m.category_2
                           ,e.categories_num
                           ,setbit (e.bitmap,m.category_2_n)    as bitmap
                           ,m.category_2_n

                from                    edges                       as e

                            join        match_detail_category_2_n   as m

                            on          m.group_id  =
                                        e.group_id

                                    and m.category_1    =
                                        e.category_2

                where       e.n <   e.categories_num - 1
                        and getbit (e.bitmap,m.category_2_n) = 0
            )

           ,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
            as 
            (   
                select      m.group_id
                           ,category_1
                           ,category_2

                           ,cast 
                            (
                                dense_rank () over 
                                (
                                    partition by    group_id 
                                    order by        category_2
                                ) - 1 

                                as byteint
                            )

                from        match_detail as m
            )

           ,group_categories_num (group_id,categories_num)
            as 
            (   
                select      group_id
                           ,count (distinct category_1)

                from        match_detail

                group by    group_id
            )

select      *

from        category_group_bitmap
;