删除两列中重复出现的值

Delete repeated occurrences of a value across two columns

我有一个按代码和描述存储薪水奖励的维度。 Award_Code 和 Award_Desc 组合形成一个自然键。每个代码应该只有一个描述,每个描述应该只有一个代码,但多年来人们添加了相同的奖励代码但描述不同或相同描述但奖励代码不同,导致 table 如下图。在此示例中,其中一个奖励代码被发现两次,但描述不同(Award_SK 6 和 Award_SK 2270),并且其中一个奖励描述被发现两次,但代码不同(Award_SK 6 和 Award_SK2209)。只有 Award_SK 6 是正确的 Award_Code/ Award_Desc 组合,我需要从维度中删除其他组合。

Award_SK Award_Code Award_Desc
6 AWDTEA Teachers Award
2209 TEAAWD Teachers Award
2270 AWDTEA Award for Teachers

为了找出上面 table 中哪些描述和代码相互关联,我有 运行 以下代码获取在 [=30 上多次连接的行=] 或 Award_Description.

--get the list of awards that are associated either by code or description, and put them in a temporary table
    SELECT * INTO #DuplicatedAwards
    FROM
    (
        SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
        FROM
        (
            --Type 1: different Award codes, same award description
            SELECT  Dim_Award_SK, award_code,AWD.Award_Desc FROM    
            DM.DIM_AWARD AWD    
            INNER JOIN  
                (SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
                GROUP BY Award_Desc, Award_Class_Desc
                HAVING count(Award_Desc)>1 
                ) A ON AWD.Award_Desc=A.Award_Desc 
            
            UNION ALL
    
            --Type 2: different award description, same award code
            SELECT  Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
            DM.DIM_AWARD AWD
            INNER JOIN
                (SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
                GROUP BY Award_Code
                HAVING count(DISTINCT Award_Desc)>1 
            ) A ON AWD.Award_Code=A.Award_Code 
        )B
    )C
    
    --Join the temporary table to the dimension on award code OR award description.  This will show an Award_SK in the first column 
    --and its matched Award_SK's in the second column
 --When a new SK starts in the first column we are looking at a new group of matched awards
    
    SELECT DISTINCT
    AW.Dim_Award_SK as Award_SK,
    DIM.Dim_Award_SK as Matching_Award_SK 
    FROM #DuplicatedAwards AW
    INNER JOIN DM.DIM_AWARD DIM
    ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
    --exclude rows where the affected SK is matched with itself
    WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
    ORDER BY  AW.Dim_Award_SK, DIM.Dim_Award_SK
    
    DROP TABLE #DuplicatedAwards

这给了我这样的结果:

Award_SK Matched Award_SK
6 2209
6 2270
8 1853
8 2278
17 2052
17 2442
22 1895
22 2282
22 2428
1853 8
1853 2278
1895 22
1895 2282
1895 2428
2052 17
2052 2442
2209 6
2209 2270
2270 6
2270 2209
2278 8
2278 1853
2282 22
2282 1895
2282 2428
2428 22
2428 1895
2428 2282
2442 17
2442 2052

左列中的前两个值相同,所以我知道我需要查看维度中的 Award_SK 6、2209 和 2270 的详细信息,以便按业务方式计算,即右边 Award_SK 保留,其他两个可以丢弃。接下来,第 3 行和第 4 行都显示 Award_SK 8,所以我知道我需要一起查看 Award_SK 8、1853 和 2278,依此类推。但是,当我浏览 table 时,这些组合会以不同的排列多次出现。 Award_SK 1853 最终再次出现在第一列中,而 Award_SK 8 和 Award_SK 2278 出现在第二列中。我的 table 中有 8000 行,但如果我停止组合重复出现,table 会明显变小,我最终会得到这样的 table。我不确定要在我的代码中添加什么来实现这一点。也许我什至可以在 Excel 中做到这一点,但同样,我不确定如何做到。

Award_SK Matched Award_SK
6 2209
6 2270
8 1853
8 2278
17 2052
17 2442
22 1895
22 2282
22 2428

我真的很感激任何帮助。谢谢

与其避免匹配相同,不如将其用作更高的匹配。

SELECT DISTINCT
  AW.Dim_Award_SK as Award_SK,
  DIM.Dim_Award_SK as Matching_Award_SK 
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
  ON ( DIM.Award_Code = AW.Award_Code OR
       DIM.Award_Desc = AW.Award_Desc 
     ) 
 AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK