Oracle 合并或重新规范化行集

Oracle consolidate or re-normalize row sets

我们有一个 table 值已扩展为非规范化集,我需要重新规范化它,找到最少数量的参考集。

源数据的简化版本如下所示:

Period  Group  Item  Seq
------  -----  ----  ---
     1      A     1    1
     1      A     2    2
     1      A     3    3
     1      B     1    1
     1      B     2    2
     1      B     3    3
     1      C     1    1
     1      C     4    2
     1      C     5    3
     1      D     2    1
     1      D     1    2
     1      D     3    3
     1      E     1    1
     1      E     2    2
     1      F     2    1
     1      F     1    2
     1      F     3    3

我想提取数据中定义的最小数量的列表,并根据时间段和组分配对列表的引用。列表由有序的项目序列组成。以下是上述数据中定义的 4 个列表:

List  Item  Seq
----  ----  ---
   1     2    1
   1     1    2
   1     3    3
   2     1    1
   2     2    2
   2     3    3
   3     1    1
   3     4    2
   3     5    3
   4     1    1
   4     2    2

以及我想要实现的输出:

Period  Group  List
------  -----  ----
     1      A     2
     1      B     2
     1      C     3
     1      D     1
     1      E     4
     1      F     1

我有一个解决方案可以使用 ORA_HASH 和 LIST_AGG 为组中的项目生成散列,但是当组中的项目数大于 400 时它会失败。产生的错误是 ORA-01489:字符串连接的结果太长。

我正在寻找一种通用的解决方案,无论在任何给定时间段内组中的项目数量如何,它都适用。

项目由小于 100,000 的整数值标识。 实际上,我们永远不会在一个组中看到超过 4000 个项目。

这在逻辑上类似于适用于最多 400 个组项目记录的方法:

WITH     
the_source_data as (
    select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual        
),    
cte_list_hash as (
select
    the_period,
    the_group, 
    ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order by the_seq)) as list_hash
from
    the_source_data
group by
    the_period,
    the_group 
),
cte_unique_lists as
(
select
    list_hash,
    min(the_period) keep (dense_rank first order by the_period, the_group) as the_period,
    min(the_group) keep (dense_rank first order by the_period, the_group) as the_group
from
    cte_list_hash
group by 
    list_hash
),
cte_list_base as
(
select    
    the_period,
    the_group,
    list_hash,
    rownum as the_list        
from
    cte_unique_lists
)
select
    A.the_period,
    A.the_group,
    B.the_list
from
    cte_list_hash A
    inner join
    cte_list_base B
        on A.list_hash = B.list_hash;

任何帮助找到正确方向的人都将不胜感激。

这是一种在不使用 LISTAGG 且不会出现 ORA-01489 错误的情况下获得结果的方法。

主要的警告是它对列表的编号与您在示例中的不同,但这种编号对我来说似乎是任意的。此版本根据使用该列表的第一个 period/group 的顺序位置对它们进行编号。也就是说,例如,第 1 期 A 组使用的列表为 "list #1".

我输入了第 2 期的一些示例数据,以确保它也正确发生。

希望下面 SQL 中的评论足够清楚地解释该方法。

终于...我不知道在大型数据集上 运行 这会持续多久。交叉连接可能有问题。

WITH     
the_source_data as (
    select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
    select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual union
    select 2 as the_period, 'F' as the_group, 1 as the_item, 1 as the_seq from dual union
    select 2 as the_period, 'F' as the_group, 4 as the_item, 2 as the_seq from dual union
    select 2 as the_period, 'F' as the_group, 5 as the_item, 3 as the_seq from dual        

),
-- this CTE counts the number of rows in each period, group.  We need this to avoid matching a long list to a shorter list that
-- happens to share the same order, as far is it goes.
sd2 as (
select sd.*, count(*) over ( partition by sd.the_period, sd.the_group) cnt from the_source_data sd ),
-- this CTE joins every row to every other rows and then filters based on matches of item#, seq, and list length
-- it then counts the number of matches by period and group (cnt3)
sd3 as ( 
select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt,
sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2 
 , count(*) over ( partition by sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3
from sd2 sd2a cross join sd2 sd2b
where   sd2b.the_item= sd2a.the_item 
and     sd2b.the_seq = sd2a.the_seq
and     sd2a.cnt = sd2b.cnt ),
-- This CTE filters to period, groups that had the same number of matches as elements in the original period, group.  I.e., it 
-- filters to perfect list matches: all elements the same, in the same order, and the list lengths are the same.
-- for each, it gets the first period and group # that share the list
sd4 as ( 
select the_period, the_group, --min(the_group2) over ( partition by the_period, the_group ) first_in_group
min(the_period2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_period, 
min(the_group2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_group 
from sd3 where cnt = cnt3 )
-- We'll arbitrarily name the lists based on the ordinal position of the first period and group that uses the list. 
select distinct the_period, the_group, dense_rank() over ( partition by null order by first_period, first_group ) list
from sd4
order by 1,2