根据同一 table 中的多个列创建关联的匹配 ID
Create an associated matching ID based on multiple columns in the same table
我有一个有趣的问题,我需要为一组数据创建一个基于匹配组的唯一标识符。这是基于多个标准,但通常我需要做的是接受这个输入:
SOURCE_ID
MATCH_ID
PHONE
1
1
(999)9999999
1
2
(999)9999999
2
1
(999)9999999
213710
707187
(001)2548987
213710
759263
(100)8348243
213705
2416730
(156)6676200
213705
12116102
(132)3453523
它需要像这样作为输出:
SOURCE_ID
MATCH_ID
PHONE
GENERATED_ID
1
1
(999)9999999
1
1
2
(999)9999999
1
2
1
(999)9999999
1
213710
707187
(001)2548987
2
213710
759263
(100)8348243
2
213705
2416730
(156)6676200
3
213705
12116102
(132)3453523
3
我利用 DENSE_RANK() 函数创建了两个独立的 ID,一个在 PHONE 上排序,另一个在 SOURCEID 列上排序。 PHONE 排序为我提供了第 1-3 行的正确输出,但第 4-7 行的输出不正确,而 SOURCE_ID 排序对第 4-6 行有效,但对第 1-3 行无效。
我怎样才能以提供上述所需输出的方式组合这些?我试过以各种可能的格式组合列,但也没有成功。
Output from testing, highlighted correct results. Each TEST## column is noted below
SQL供参考:
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
DENSE_RANK() OVER(ORDER BY PHONE) AS PHONE_SORT
DENSE_RANK() OVER(ORDER BY SOURCE_ID) AS SOURCE_ID_SORT
DENSE_RANK() OVER(ORDER BY MATCH_ID, INTERNAL_ROW_ID) AS TEST1,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID) AS TEST2,
DENSE_RANK() OVER(ORDER BY MATCH_ID) AS TEST3,
DENSE_RANK() OVER(ORDER BY MATCH_ID, SOURCE_ID, PHONE) AS TEST4,
DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE, SOURCE_ID) AS TEST5,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID, PHONE) AS TEST6,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE, MATCH_ID) AS TEST7,
DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID, MATCH_ID) AS TEST8,
DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID, SOURCE_ID) AS TEST9,
DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID) AS TEST10,
DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID) AS TEST11,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE) AS TEST12,
DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE) AS TEST13
FROM MY_TABLE;
TIA!
更新——这有点像预期的那样工作,但是当引入额外的记录时(这是完整数据集的一个非常小的子集),开始 运行 进入额外的场景。现在,尝试使更大的数据集看起来更接近于此:
Correct result set
我使用以下代码几乎达到了目的,但努力正确地重新关联最后几条记录:
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
DENSE_RANK() OVER(ORDER BY RANKABLE_MATCH_ID) AS GENERATED_ID
FROM (
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
COUNT(MATCH_ID) OVER (PARTITION BY MATCH_ID) C_MATCH_ID,
IFF(C_PHONE >= C_SOURCE_BY_MATCH AND C_MATCH_ID = C_PHONE, SOURCE_ID::TEXT, RANKABLE_INTERNAL_PHONE) AS RANKABLE_MATCH_ID
FROM (
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
COUNT(SOURCE_ID) OVER (PARTITION BY SOURCE_ID) C_SOURCE_ID,
COUNT(PHONE) OVER (PARTITION BY PHONE) C_PHONE,
COUNT(DISTINCT SOURCE_ID) OVER(PARTITION BY MATCH_ID) C_SOURCE_BY_MATCH,
IFF(C_SOURCE_ID > C_PHONE, SOURCE_ID::TEXT, PHONE) AS RANKABLE_INTERNAL_PHONE
FROM MY_TABLE
)
)
My output based on above code
My output based on above code
这是一些相当迟钝的逻辑。
SELECT
column1
,column2
,column3
,dense_rank() over (order by rankable)
FROM (
SELECT *
,count(column1) over (partition by column1) c_c1
,count(column3) over (partition by column3) c_c3
,iff(c_c1> c_c3, column1::text, column3) as rankable
FROM VALUES
(1,1,'(999)9999999'),
(1, 2,'(999)9999999'),
(2, 1,'(999)9999999'),
(213710, 707187,'(001)2548987'),
(213710, 759263,'(100)8348243'),
(213705, 2416730,'(156)6676200'),
(213705, 12116102,'(132)3453523')
)
给出:
COLUMN1
COLUMN2
COLUMN3
DENSE_RANK() OVER (ORDER BY RANKABLE)
1
1
(999)9999999
1
1
2
(999)9999999
1
2
1
(999)9999999
1
213,705
2,416,730
(156)6676200
2
213,705
12,116,102
(132)3453523
2
213,710
707,187
(001)2548987
3
213,710
759,263
(100)8348243
3
更复杂的答案:
所以你的扩展问题表明你实际上是在 SETS 上聚类,因此对于任何 SOURCE_ID
所有 PHONE
都是同一个集合的一部分,因此所有 SOURCE_ID's that are part of the
PHONE`的集合也在群里。这真的应该用递归 CTE 来解决,以允许更多的 2 步关系。这是一个处理 2 层链接的解决方案..
WITH data AS (
SELECT * FROM VALUES
(2, '(999)9999999'),
(1, '(999)9999999'),
(1, '(999)9999999'),
(2, '(999)9999999'),
(213705, 'AAA'),
(213705, 'AAB'),
(213705, 'AAC'),
(9624765, 'AAA'),
(9624765, 'AAB'),
(9624765, 'AAC'),
(2175594867, 'AAA'),
(2175594867, 'AAB'),
(213710, 'BAA'),
(213710, 'BAB'),
(9213710, 'BAA'),
(9213710, 'BAB'),
(89213710, 'BAA'),
(89213710, 'BAB')
), col1 as (
select column1
,array_agg(DISTINCT column2) as col2_array
from data
group by 1
), col2 as (
select
*,
row_number() over (order by true) as rn
FROM (
select col2_array
,array_agg(DISTINCT column1) as col1_array
from col1
group by 1
)
)
SELECT d.column1, d.column2, r.rn
FROM data as d
JOIN col2 as r
on array_contains(d.column1::variant, r.col1_array)
and array_contains(d.column2::variant, r.col2_array)
ORDER BY 3;
COLUMN1
COLUMN2
RN
89,213,710
BAB
1
89,213,710
BAA
1
9,213,710
BAB
1
9,213,710
BAA
1
213,710
BAB
1
213,710
BAA
1
1
(999)9999999
2
2
(999)9999999
2
1
(999)9999999
2
2
(999)9999999
2
2,175,594,867
AAA
3
2,175,594,867
AAB
3
9,624,765
AAA
4
9,624,765
AAB
4
9,624,765
AAC
4
213,705
AAC
4
213,705
AAB
4
213,705
AAA
4
我有一个有趣的问题,我需要为一组数据创建一个基于匹配组的唯一标识符。这是基于多个标准,但通常我需要做的是接受这个输入:
SOURCE_ID | MATCH_ID | PHONE |
---|---|---|
1 | 1 | (999)9999999 |
1 | 2 | (999)9999999 |
2 | 1 | (999)9999999 |
213710 | 707187 | (001)2548987 |
213710 | 759263 | (100)8348243 |
213705 | 2416730 | (156)6676200 |
213705 | 12116102 | (132)3453523 |
它需要像这样作为输出:
SOURCE_ID | MATCH_ID | PHONE | GENERATED_ID |
---|---|---|---|
1 | 1 | (999)9999999 | 1 |
1 | 2 | (999)9999999 | 1 |
2 | 1 | (999)9999999 | 1 |
213710 | 707187 | (001)2548987 | 2 |
213710 | 759263 | (100)8348243 | 2 |
213705 | 2416730 | (156)6676200 | 3 |
213705 | 12116102 | (132)3453523 | 3 |
我利用 DENSE_RANK() 函数创建了两个独立的 ID,一个在 PHONE 上排序,另一个在 SOURCEID 列上排序。 PHONE 排序为我提供了第 1-3 行的正确输出,但第 4-7 行的输出不正确,而 SOURCE_ID 排序对第 4-6 行有效,但对第 1-3 行无效。
我怎样才能以提供上述所需输出的方式组合这些?我试过以各种可能的格式组合列,但也没有成功。
Output from testing, highlighted correct results. Each TEST## column is noted below
SQL供参考:
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
DENSE_RANK() OVER(ORDER BY PHONE) AS PHONE_SORT
DENSE_RANK() OVER(ORDER BY SOURCE_ID) AS SOURCE_ID_SORT
DENSE_RANK() OVER(ORDER BY MATCH_ID, INTERNAL_ROW_ID) AS TEST1,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID) AS TEST2,
DENSE_RANK() OVER(ORDER BY MATCH_ID) AS TEST3,
DENSE_RANK() OVER(ORDER BY MATCH_ID, SOURCE_ID, PHONE) AS TEST4,
DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE, SOURCE_ID) AS TEST5,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, MATCH_ID, PHONE) AS TEST6,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE, MATCH_ID) AS TEST7,
DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID, MATCH_ID) AS TEST8,
DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID, SOURCE_ID) AS TEST9,
DENSE_RANK() OVER(ORDER BY PHONE, SOURCE_ID) AS TEST10,
DENSE_RANK() OVER(ORDER BY PHONE, MATCH_ID) AS TEST11,
DENSE_RANK() OVER(ORDER BY SOURCE_ID, PHONE) AS TEST12,
DENSE_RANK() OVER(ORDER BY MATCH_ID, PHONE) AS TEST13
FROM MY_TABLE;
TIA!
更新——这有点像预期的那样工作,但是当引入额外的记录时(这是完整数据集的一个非常小的子集),开始 运行 进入额外的场景。现在,尝试使更大的数据集看起来更接近于此:
Correct result set
我使用以下代码几乎达到了目的,但努力正确地重新关联最后几条记录:
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
DENSE_RANK() OVER(ORDER BY RANKABLE_MATCH_ID) AS GENERATED_ID
FROM (
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
COUNT(MATCH_ID) OVER (PARTITION BY MATCH_ID) C_MATCH_ID,
IFF(C_PHONE >= C_SOURCE_BY_MATCH AND C_MATCH_ID = C_PHONE, SOURCE_ID::TEXT, RANKABLE_INTERNAL_PHONE) AS RANKABLE_MATCH_ID
FROM (
SELECT SOURCE_ID,
MATCH_ID,
PHONE,
COUNT(SOURCE_ID) OVER (PARTITION BY SOURCE_ID) C_SOURCE_ID,
COUNT(PHONE) OVER (PARTITION BY PHONE) C_PHONE,
COUNT(DISTINCT SOURCE_ID) OVER(PARTITION BY MATCH_ID) C_SOURCE_BY_MATCH,
IFF(C_SOURCE_ID > C_PHONE, SOURCE_ID::TEXT, PHONE) AS RANKABLE_INTERNAL_PHONE
FROM MY_TABLE
)
)
My output based on above code My output based on above code
这是一些相当迟钝的逻辑。
SELECT
column1
,column2
,column3
,dense_rank() over (order by rankable)
FROM (
SELECT *
,count(column1) over (partition by column1) c_c1
,count(column3) over (partition by column3) c_c3
,iff(c_c1> c_c3, column1::text, column3) as rankable
FROM VALUES
(1,1,'(999)9999999'),
(1, 2,'(999)9999999'),
(2, 1,'(999)9999999'),
(213710, 707187,'(001)2548987'),
(213710, 759263,'(100)8348243'),
(213705, 2416730,'(156)6676200'),
(213705, 12116102,'(132)3453523')
)
给出:
COLUMN1 | COLUMN2 | COLUMN3 | DENSE_RANK() OVER (ORDER BY RANKABLE) |
---|---|---|---|
1 | 1 | (999)9999999 | 1 |
1 | 2 | (999)9999999 | 1 |
2 | 1 | (999)9999999 | 1 |
213,705 | 2,416,730 | (156)6676200 | 2 |
213,705 | 12,116,102 | (132)3453523 | 2 |
213,710 | 707,187 | (001)2548987 | 3 |
213,710 | 759,263 | (100)8348243 | 3 |
更复杂的答案:
所以你的扩展问题表明你实际上是在 SETS 上聚类,因此对于任何 SOURCE_ID
所有 PHONE
都是同一个集合的一部分,因此所有 SOURCE_ID's that are part of the
PHONE`的集合也在群里。这真的应该用递归 CTE 来解决,以允许更多的 2 步关系。这是一个处理 2 层链接的解决方案..
WITH data AS (
SELECT * FROM VALUES
(2, '(999)9999999'),
(1, '(999)9999999'),
(1, '(999)9999999'),
(2, '(999)9999999'),
(213705, 'AAA'),
(213705, 'AAB'),
(213705, 'AAC'),
(9624765, 'AAA'),
(9624765, 'AAB'),
(9624765, 'AAC'),
(2175594867, 'AAA'),
(2175594867, 'AAB'),
(213710, 'BAA'),
(213710, 'BAB'),
(9213710, 'BAA'),
(9213710, 'BAB'),
(89213710, 'BAA'),
(89213710, 'BAB')
), col1 as (
select column1
,array_agg(DISTINCT column2) as col2_array
from data
group by 1
), col2 as (
select
*,
row_number() over (order by true) as rn
FROM (
select col2_array
,array_agg(DISTINCT column1) as col1_array
from col1
group by 1
)
)
SELECT d.column1, d.column2, r.rn
FROM data as d
JOIN col2 as r
on array_contains(d.column1::variant, r.col1_array)
and array_contains(d.column2::variant, r.col2_array)
ORDER BY 3;
COLUMN1 | COLUMN2 | RN |
---|---|---|
89,213,710 | BAB | 1 |
89,213,710 | BAA | 1 |
9,213,710 | BAB | 1 |
9,213,710 | BAA | 1 |
213,710 | BAB | 1 |
213,710 | BAA | 1 |
1 | (999)9999999 | 2 |
2 | (999)9999999 | 2 |
1 | (999)9999999 | 2 |
2 | (999)9999999 | 2 |
2,175,594,867 | AAA | 3 |
2,175,594,867 | AAB | 3 |
9,624,765 | AAA | 4 |
9,624,765 | AAB | 4 |
9,624,765 | AAC | 4 |
213,705 | AAC | 4 |
213,705 | AAB | 4 |
213,705 | AAA | 4 |