查找比较 table 列的最佳匹配项

Find the top matches comparing table columns

我有一个数据库,其中包含来自不同来源的多达 400 个 table。我需要按列相似性对 excel 文件中的那些 table 进行分组(考虑到 table 具有 0、1、2 或所有具有相同名称的列)。挑战如下:

fac.table_1 have columns C1, C2, C3, C4 and C5
dim.table_2 has columns C1, C3, and C5
stg.table_3 has columns C1, C6, and C7
stg.table_4 has columns C2, and C99

...

预期结果应该是:

sch_name | table_name | ncols | nmatches
  dim    |   table_2  |   3   |   3
  stg    |   table_3  |   3   |   1
  stg    |   table_4  |   2   |   1

我认为方法是将类似此代码的代码与 COUNT 或 INTERSECT 一起使用,在 WHERE 中插入我想与其他人比较的 table 名称:

    SELECT
       schemas.name sch_nm,
       tables.name tb_nm,
       columns.name col_nm
    FROM sys.tables
       LEFT JOIN sys.columns ON tables.object_id = columns.object_id
       LEFT JOIN sys.schemas ON tables.schema_id = schemas.schema_id

您想统计列名在另一个table中存在的列数?

select sch_name, tbl_name, 
       ncols      = count(*), 
       nmatches   = sum(case when col_cnt > 1 then 1 else 0 end),
       percentage = sum(case when col_cnt > 1 then 1 else 0 end) * 100 / count(*) 
from
(
    select sch_name = s.name, 
           tbl_name = t.name,
           col_name = c.name,
           col_cnt  = count(c.name) over(partition by c.name)
    from   sys.schemas s
           inner join sys.tables t  on s.schema_id = t.schema_id
           inner join sys.columns c on t.object_id = c.object_id
    where  t.name in ('table1', 'table2', 'table3', 'table4')
) c       
where tbl_name not in ('table1')
group by sch_name, tbl_name
order by c.tbl_name;

结果:

sch_name tbl_name ncols nmatches
fac table_1 5 4
dim table_2 3 3
stg table_3 3 1
stg table_4 2 1

db<>fiddle demo