删除两列中重复出现的值
Delete repeated occurrences of a value across two columns
我有一个按代码和描述存储薪水奖励的维度。 Award_Code 和 Award_Desc 组合形成一个自然键。每个代码应该只有一个描述,每个描述应该只有一个代码,但多年来人们添加了相同的奖励代码但描述不同或相同描述但奖励代码不同,导致 table 如下图。在此示例中,其中一个奖励代码被发现两次,但描述不同(Award_SK 6 和 Award_SK 2270),并且其中一个奖励描述被发现两次,但代码不同(Award_SK 6 和 Award_SK2209)。只有 Award_SK 6 是正确的 Award_Code/ Award_Desc 组合,我需要从维度中删除其他组合。
Award_SK
Award_Code
Award_Desc
6
AWDTEA
Teachers Award
2209
TEAAWD
Teachers Award
2270
AWDTEA
Award for Teachers
为了找出上面 table 中哪些描述和代码相互关联,我有 运行 以下代码获取在 [=30 上多次连接的行=] 或 Award_Description.
--get the list of awards that are associated either by code or description, and put them in a temporary table
SELECT * INTO #DuplicatedAwards
FROM
(
SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
FROM
(
--Type 1: different Award codes, same award description
SELECT Dim_Award_SK, award_code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
GROUP BY Award_Desc, Award_Class_Desc
HAVING count(Award_Desc)>1
) A ON AWD.Award_Desc=A.Award_Desc
UNION ALL
--Type 2: different award description, same award code
SELECT Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
GROUP BY Award_Code
HAVING count(DISTINCT Award_Desc)>1
) A ON AWD.Award_Code=A.Award_Code
)B
)C
--Join the temporary table to the dimension on award code OR award description. This will show an Award_SK in the first column
--and its matched Award_SK's in the second column
--When a new SK starts in the first column we are looking at a new group of matched awards
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
INNER JOIN DM.DIM_AWARD DIM
ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
--exclude rows where the affected SK is matched with itself
WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK
DROP TABLE #DuplicatedAwards
这给了我这样的结果:
Award_SK
Matched Award_SK
6
2209
6
2270
8
1853
8
2278
17
2052
17
2442
22
1895
22
2282
22
2428
1853
8
1853
2278
1895
22
1895
2282
1895
2428
2052
17
2052
2442
2209
6
2209
2270
2270
6
2270
2209
2278
8
2278
1853
2282
22
2282
1895
2282
2428
2428
22
2428
1895
2428
2282
2442
17
2442
2052
左列中的前两个值相同,所以我知道我需要查看维度中的 Award_SK 6、2209 和 2270 的详细信息,以便按业务方式计算,即右边 Award_SK 保留,其他两个可以丢弃。接下来,第 3 行和第 4 行都显示 Award_SK 8,所以我知道我需要一起查看 Award_SK 8、1853 和 2278,依此类推。但是,当我浏览 table 时,这些组合会以不同的排列多次出现。 Award_SK 1853 最终再次出现在第一列中,而 Award_SK 8 和 Award_SK 2278 出现在第二列中。我的 table 中有 8000 行,但如果我停止组合重复出现,table 会明显变小,我最终会得到这样的 table。我不确定要在我的代码中添加什么来实现这一点。也许我什至可以在 Excel 中做到这一点,但同样,我不确定如何做到。
Award_SK
Matched Award_SK
6
2209
6
2270
8
1853
8
2278
17
2052
17
2442
22
1895
22
2282
22
2428
我真的很感激任何帮助。谢谢
与其避免匹配相同,不如将其用作更高的匹配。
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
ON ( DIM.Award_Code = AW.Award_Code OR
DIM.Award_Desc = AW.Award_Desc
)
AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK
我有一个按代码和描述存储薪水奖励的维度。 Award_Code 和 Award_Desc 组合形成一个自然键。每个代码应该只有一个描述,每个描述应该只有一个代码,但多年来人们添加了相同的奖励代码但描述不同或相同描述但奖励代码不同,导致 table 如下图。在此示例中,其中一个奖励代码被发现两次,但描述不同(Award_SK 6 和 Award_SK 2270),并且其中一个奖励描述被发现两次,但代码不同(Award_SK 6 和 Award_SK2209)。只有 Award_SK 6 是正确的 Award_Code/ Award_Desc 组合,我需要从维度中删除其他组合。
Award_SK | Award_Code | Award_Desc |
---|---|---|
6 | AWDTEA | Teachers Award |
2209 | TEAAWD | Teachers Award |
2270 | AWDTEA | Award for Teachers |
为了找出上面 table 中哪些描述和代码相互关联,我有 运行 以下代码获取在 [=30 上多次连接的行=] 或 Award_Description.
--get the list of awards that are associated either by code or description, and put them in a temporary table
SELECT * INTO #DuplicatedAwards
FROM
(
SELECT Dim_Award_SK,AWARD_CODE, AWARD_DESC
FROM
(
--Type 1: different Award codes, same award description
SELECT Dim_Award_SK, award_code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Desc, COUNT(Dim_Award_SK) as total_of_Same_Description_different_code FROM DM.DIM_AWARD
GROUP BY Award_Desc, Award_Class_Desc
HAVING count(Award_Desc)>1
) A ON AWD.Award_Desc=A.Award_Desc
UNION ALL
--Type 2: different award description, same award code
SELECT Dim_Award_SK, A.Award_Code,AWD.Award_Desc FROM
DM.DIM_AWARD AWD
INNER JOIN
(SELECT Award_Code,COUNT(Dim_Award_SK) as Total_of_Same_Code_Different_Description FROM DM.DIM_AWARD
GROUP BY Award_Code
HAVING count(DISTINCT Award_Desc)>1
) A ON AWD.Award_Code=A.Award_Code
)B
)C
--Join the temporary table to the dimension on award code OR award description. This will show an Award_SK in the first column
--and its matched Award_SK's in the second column
--When a new SK starts in the first column we are looking at a new group of matched awards
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
INNER JOIN DM.DIM_AWARD DIM
ON DIM.Award_Code=AW.Award_Code OR DIM.Award_Desc=AW.Award_Desc
--exclude rows where the affected SK is matched with itself
WHERE DIM.Dim_Award_SK <> AW.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK
DROP TABLE #DuplicatedAwards
这给了我这样的结果:
Award_SK | Matched Award_SK |
---|---|
6 | 2209 |
6 | 2270 |
8 | 1853 |
8 | 2278 |
17 | 2052 |
17 | 2442 |
22 | 1895 |
22 | 2282 |
22 | 2428 |
1853 | 8 |
1853 | 2278 |
1895 | 22 |
1895 | 2282 |
1895 | 2428 |
2052 | 17 |
2052 | 2442 |
2209 | 6 |
2209 | 2270 |
2270 | 6 |
2270 | 2209 |
2278 | 8 |
2278 | 1853 |
2282 | 22 |
2282 | 1895 |
2282 | 2428 |
2428 | 22 |
2428 | 1895 |
2428 | 2282 |
2442 | 17 |
2442 | 2052 |
左列中的前两个值相同,所以我知道我需要查看维度中的 Award_SK 6、2209 和 2270 的详细信息,以便按业务方式计算,即右边 Award_SK 保留,其他两个可以丢弃。接下来,第 3 行和第 4 行都显示 Award_SK 8,所以我知道我需要一起查看 Award_SK 8、1853 和 2278,依此类推。但是,当我浏览 table 时,这些组合会以不同的排列多次出现。 Award_SK 1853 最终再次出现在第一列中,而 Award_SK 8 和 Award_SK 2278 出现在第二列中。我的 table 中有 8000 行,但如果我停止组合重复出现,table 会明显变小,我最终会得到这样的 table。我不确定要在我的代码中添加什么来实现这一点。也许我什至可以在 Excel 中做到这一点,但同样,我不确定如何做到。
Award_SK | Matched Award_SK |
---|---|
6 | 2209 |
6 | 2270 |
8 | 1853 |
8 | 2278 |
17 | 2052 |
17 | 2442 |
22 | 1895 |
22 | 2282 |
22 | 2428 |
我真的很感激任何帮助。谢谢
与其避免匹配相同,不如将其用作更高的匹配。
SELECT DISTINCT
AW.Dim_Award_SK as Award_SK,
DIM.Dim_Award_SK as Matching_Award_SK
FROM #DuplicatedAwards AW
JOIN DM.DIM_AWARD DIM
ON ( DIM.Award_Code = AW.Award_Code OR
DIM.Award_Desc = AW.Award_Desc
)
AND AW.Dim_Award_SK < DIM.Dim_Award_SK
ORDER BY AW.Dim_Award_SK, DIM.Dim_Award_SK