从 3 个表(或更多)的比较中选择缺失值

Selecting missing values from a comparison of 3 tables (or more)

我有 3 个 table 具有相同的列,来自不同的来源。首先,我有 A 列和 B 列,它们应该具有独特的组合。 我想比较三个 table 之间的 A 列和 B 列,如果有缺失值,select A 和 B 值对以及 table 它们从.最好也计算缺失值。

最终结果应该是一个 listagg,其中包含 A 列的值与缺失的 B 列值的计数,table

示例,具有特定的列名:

A 列 = 地区,B 列 = Customer_ID

然后我们有 3 tables:

Table 1 : Table1.Region | Table1.Customer_ID
Table 2 : Table2.Region | Table2.Customer_ID
Table 3 : Table3.Region | Table3.Customer_ID

在上述情况下,对于区域“001”,Table 1 缺少存在于 Table 2 和 Table 3 中的 6 个值。

此外,Table 2 缺少区域“002”的 2 个值。

想要的结果应该是 Listagg,像这样:

result: ("Table 1", 001, 6; "Table 2", 002, 2;)

如果我没理解错的话,你需要这样的查询 LISTAGG() 函数包含 OUTER JOIN 个表

SELECT LISTAGG( NVL2(t2.Customer_ID,'"Table2"','"Table1"')||','||t3.Region||','
                   ||t3.Customer_ID, ';' )      
       WITHIN GROUP (ORDER BY t3.Customer_ID) AS "Result"
  FROM t3
  LEFT JOIN t2 ON t2.Customer_ID = t3.Customer_ID
  LEFT JOIN t1 ON t1.Customer_ID = t3.Customer_ID

Demo

  • 我用标签列“tab”对所有三个表进行联合
  • 然后,在内联视图“t”中,我使用 having count(tab) != 3 来仅保留三个表中都不存在的那些行;然后我使用 sum(tab) 结果的逻辑来区分源表。
  • 然后,在内联视图“tt”中,我使用计数分析函数按 REGION 和 tX_missing
  • 对行进行计数
  • 然后在内联视图“ttt”中,我按区域对行进行分组,并准备输出格式(每列)
  • 最后,我使用 listagg
with compare_tab as (
select Region, Customer_ID, 1 tab from t1 union all
select Region, Customer_ID, 2 tab from t2 union all
select Region, Customer_ID, 4 tab from t3
)
select listagg(merge_col, chr(10)) within group (order by merge_col)
from (
  select tt.region
     , '"Table 1", '||tt.region||', '||max(count_t1_missing)
    || ', "Table 2", '||tt.region||', '||max(count_t2_missing)
    || ', "Table 3", '||tt.region||', '||max(count_t3_missing) merge_col
  from ( 
    select region, Customer_ID
    , count(t1_missing)over(partition by REGION, t1_missing) count_t1_missing
    , count(t2_missing)over(partition by REGION, t2_missing) count_t2_missing
    , count(t3_missing)over(partition by REGION, t3_missing) count_t3_missing
    from (
      select Region, Customer_ID--, count(tab) cnt, sum(tab)s
      , case when sum(tab) in (2, 4, 6) then 'Table1' end t1_missing
      , case when sum(tab) in (1, 4, 5) then 'Table2' end t2_missing
      , case when sum(tab) in (1, 2, 3) then 'Table3' end t3_missing
      from compare_tab
      group by Region, Customer_ID
      having count(tab) != 3
      order by 1, 2, 3, 4
    ) t
  )tt
  group by tt.region
)ttt
;

这是我的示例数据

create table t1 (Region varchar2(50), Customer_ID number(4));
create table t2 (Region varchar2(50), Customer_ID number(4));
create table t3 (Region varchar2(50), Customer_ID number(4));

insert all
when mod(customer, 3) = 0  then INTO t3 (Region, Customer_ID) values (region, customer)
when mod(customer, 2) = 0  then INTO t2 (Region, Customer_ID) values (region, customer)
when mod(customer, 5) = 0 then INTO t1 (Region, Customer_ID) values (region, customer)
select lpad(case when mod(level, 5) = 0 then 5 else mod(level, 5) end, 3, '0') region, level customer
from dual
connect by level <= 25
order by 1
;

下面获取各个区域的数值分布:

select region, in_1, in_2, in_3, count(*)
from (select region, customer_id, max(in_1) as in_1, max(in_2) as in_2, max(in_3) as in_3
      from ((select region, customer_id, 1 as in_1, 0 as in_2, 0 as in_3
             from table1
            ) union all
            (select region, customer_id, 0 as in_1, 1 as in_2, 0 as in_3
             from table2
            ) union all
            (select region, customer_id, 0 as in_1, 0 as in_2, 1 as in_3
             from table3
            ) 
           ) t
      group by region, customer_id
     ) rc
group by region, in_1, in_2, in_3
order by region, count(*) desc;

我不是 100% 清楚如何将其转换为您想要的格式。但我认为这将是:

select region,
       ( 'Table1: ' || count(*) - sum(in_1) || ';' ||
         'Table2: ' || count(*) - sum(in_2) || ';' ||
         'Table3: ' || count(*) - sum(in_3) 
       ) as summary
from (select region, customer_id, max(in_1) as in_1, max(in_2) as in_2, max(in_3) as in_3
      from ((select region, customer_id, 1 as in_1, 0 as in_2, 0 as in_3
             from table1
            ) union all
            (select region, customer_id, 0 as in_1, 1 as in_2, 0 as in_3
             from table2
            ) union all
            (select region, customer_id, 0 as in_1, 0 as in_2, 1 as in_3
             from table3
            ) 
           ) t
      group by region, customer_id
     ) rc
group by region
order by region;

不过,我认为第一种格式的信息量更大。