Oracle 合并或重新规范化行集
Oracle consolidate or re-normalize row sets
我们有一个 table 值已扩展为非规范化集,我需要重新规范化它,找到最少数量的参考集。
源数据的简化版本如下所示:
Period Group Item Seq
------ ----- ---- ---
1 A 1 1
1 A 2 2
1 A 3 3
1 B 1 1
1 B 2 2
1 B 3 3
1 C 1 1
1 C 4 2
1 C 5 3
1 D 2 1
1 D 1 2
1 D 3 3
1 E 1 1
1 E 2 2
1 F 2 1
1 F 1 2
1 F 3 3
我想提取数据中定义的最小数量的列表,并根据时间段和组分配对列表的引用。列表由有序的项目序列组成。以下是上述数据中定义的 4 个列表:
List Item Seq
---- ---- ---
1 2 1
1 1 2
1 3 3
2 1 1
2 2 2
2 3 3
3 1 1
3 4 2
3 5 3
4 1 1
4 2 2
以及我想要实现的输出:
Period Group List
------ ----- ----
1 A 2
1 B 2
1 C 3
1 D 1
1 E 4
1 F 1
我有一个解决方案可以使用 ORA_HASH 和 LIST_AGG 为组中的项目生成散列,但是当组中的项目数大于 400 时它会失败。产生的错误是 ORA-01489:字符串连接的结果太长。
我正在寻找一种通用的解决方案,无论在任何给定时间段内组中的项目数量如何,它都适用。
项目由小于 100,000 的整数值标识。
实际上,我们永远不会在一个组中看到超过 4000 个项目。
这在逻辑上类似于适用于最多 400 个组项目记录的方法:
WITH
the_source_data as (
select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual
),
cte_list_hash as (
select
the_period,
the_group,
ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order by the_seq)) as list_hash
from
the_source_data
group by
the_period,
the_group
),
cte_unique_lists as
(
select
list_hash,
min(the_period) keep (dense_rank first order by the_period, the_group) as the_period,
min(the_group) keep (dense_rank first order by the_period, the_group) as the_group
from
cte_list_hash
group by
list_hash
),
cte_list_base as
(
select
the_period,
the_group,
list_hash,
rownum as the_list
from
cte_unique_lists
)
select
A.the_period,
A.the_group,
B.the_list
from
cte_list_hash A
inner join
cte_list_base B
on A.list_hash = B.list_hash;
任何帮助找到正确方向的人都将不胜感激。
这是一种在不使用 LISTAGG
且不会出现 ORA-01489
错误的情况下获得结果的方法。
主要的警告是它对列表的编号与您在示例中的不同,但这种编号对我来说似乎是任意的。此版本根据使用该列表的第一个 period/group 的顺序位置对它们进行编号。也就是说,例如,第 1 期 A 组使用的列表为 "list #1".
我输入了第 2 期的一些示例数据,以确保它也正确发生。
希望下面 SQL 中的评论足够清楚地解释该方法。
终于...我不知道在大型数据集上 运行 这会持续多久。交叉连接可能有问题。
WITH
the_source_data as (
select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 1 as the_item, 1 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 4 as the_item, 2 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 5 as the_item, 3 as the_seq from dual
),
-- this CTE counts the number of rows in each period, group. We need this to avoid matching a long list to a shorter list that
-- happens to share the same order, as far is it goes.
sd2 as (
select sd.*, count(*) over ( partition by sd.the_period, sd.the_group) cnt from the_source_data sd ),
-- this CTE joins every row to every other rows and then filters based on matches of item#, seq, and list length
-- it then counts the number of matches by period and group (cnt3)
sd3 as (
select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt,
sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2
, count(*) over ( partition by sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3
from sd2 sd2a cross join sd2 sd2b
where sd2b.the_item= sd2a.the_item
and sd2b.the_seq = sd2a.the_seq
and sd2a.cnt = sd2b.cnt ),
-- This CTE filters to period, groups that had the same number of matches as elements in the original period, group. I.e., it
-- filters to perfect list matches: all elements the same, in the same order, and the list lengths are the same.
-- for each, it gets the first period and group # that share the list
sd4 as (
select the_period, the_group, --min(the_group2) over ( partition by the_period, the_group ) first_in_group
min(the_period2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_period,
min(the_group2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_group
from sd3 where cnt = cnt3 )
-- We'll arbitrarily name the lists based on the ordinal position of the first period and group that uses the list.
select distinct the_period, the_group, dense_rank() over ( partition by null order by first_period, first_group ) list
from sd4
order by 1,2
我们有一个 table 值已扩展为非规范化集,我需要重新规范化它,找到最少数量的参考集。
源数据的简化版本如下所示:
Period Group Item Seq
------ ----- ---- ---
1 A 1 1
1 A 2 2
1 A 3 3
1 B 1 1
1 B 2 2
1 B 3 3
1 C 1 1
1 C 4 2
1 C 5 3
1 D 2 1
1 D 1 2
1 D 3 3
1 E 1 1
1 E 2 2
1 F 2 1
1 F 1 2
1 F 3 3
我想提取数据中定义的最小数量的列表,并根据时间段和组分配对列表的引用。列表由有序的项目序列组成。以下是上述数据中定义的 4 个列表:
List Item Seq
---- ---- ---
1 2 1
1 1 2
1 3 3
2 1 1
2 2 2
2 3 3
3 1 1
3 4 2
3 5 3
4 1 1
4 2 2
以及我想要实现的输出:
Period Group List
------ ----- ----
1 A 2
1 B 2
1 C 3
1 D 1
1 E 4
1 F 1
我有一个解决方案可以使用 ORA_HASH 和 LIST_AGG 为组中的项目生成散列,但是当组中的项目数大于 400 时它会失败。产生的错误是 ORA-01489:字符串连接的结果太长。
我正在寻找一种通用的解决方案,无论在任何给定时间段内组中的项目数量如何,它都适用。
项目由小于 100,000 的整数值标识。 实际上,我们永远不会在一个组中看到超过 4000 个项目。
这在逻辑上类似于适用于最多 400 个组项目记录的方法:
WITH
the_source_data as (
select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual
),
cte_list_hash as (
select
the_period,
the_group,
ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order by the_seq)) as list_hash
from
the_source_data
group by
the_period,
the_group
),
cte_unique_lists as
(
select
list_hash,
min(the_period) keep (dense_rank first order by the_period, the_group) as the_period,
min(the_group) keep (dense_rank first order by the_period, the_group) as the_group
from
cte_list_hash
group by
list_hash
),
cte_list_base as
(
select
the_period,
the_group,
list_hash,
rownum as the_list
from
cte_unique_lists
)
select
A.the_period,
A.the_group,
B.the_list
from
cte_list_hash A
inner join
cte_list_base B
on A.list_hash = B.list_hash;
任何帮助找到正确方向的人都将不胜感激。
这是一种在不使用 LISTAGG
且不会出现 ORA-01489
错误的情况下获得结果的方法。
主要的警告是它对列表的编号与您在示例中的不同,但这种编号对我来说似乎是任意的。此版本根据使用该列表的第一个 period/group 的顺序位置对它们进行编号。也就是说,例如,第 1 期 A 组使用的列表为 "list #1".
我输入了第 2 期的一些示例数据,以确保它也正确发生。
希望下面 SQL 中的评论足够清楚地解释该方法。
终于...我不知道在大型数据集上 运行 这会持续多久。交叉连接可能有问题。
WITH
the_source_data as (
select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union
select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 1 as the_item, 1 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 4 as the_item, 2 as the_seq from dual union
select 2 as the_period, 'F' as the_group, 5 as the_item, 3 as the_seq from dual
),
-- this CTE counts the number of rows in each period, group. We need this to avoid matching a long list to a shorter list that
-- happens to share the same order, as far is it goes.
sd2 as (
select sd.*, count(*) over ( partition by sd.the_period, sd.the_group) cnt from the_source_data sd ),
-- this CTE joins every row to every other rows and then filters based on matches of item#, seq, and list length
-- it then counts the number of matches by period and group (cnt3)
sd3 as (
select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt,
sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2
, count(*) over ( partition by sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3
from sd2 sd2a cross join sd2 sd2b
where sd2b.the_item= sd2a.the_item
and sd2b.the_seq = sd2a.the_seq
and sd2a.cnt = sd2b.cnt ),
-- This CTE filters to period, groups that had the same number of matches as elements in the original period, group. I.e., it
-- filters to perfect list matches: all elements the same, in the same order, and the list lengths are the same.
-- for each, it gets the first period and group # that share the list
sd4 as (
select the_period, the_group, --min(the_group2) over ( partition by the_period, the_group ) first_in_group
min(the_period2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_period,
min(the_group2) keep ( DENSE_RANK FIRST ORDER BY the_period2, the_group2 ) OVER ( partition by the_period, the_group) first_group
from sd3 where cnt = cnt3 )
-- We'll arbitrarily name the lists based on the ordinal position of the first period and group that uses the list.
select distinct the_period, the_group, dense_rank() over ( partition by null order by first_period, first_group ) list
from sd4
order by 1,2