标记记录孤岛
Tag islands of records
我有这个数据集:
Id
PrevId
NextId
Product
Process
Date
1
NULL
4
Product 1
Process A
2021-04-24
2
NULL
3
Product 2
Process A
2021-04-24
3
2
5
Product 2
Process A
2021-04-24
4
1
7
Product 1
Process B
2021-04-26
5
3
6
Product 2
Process B
2021-04-24
6
5
NULL
Product 2
Process B
2021-04-24
7
4
9
Product 1
Process B
2021-04-29
9
7
10
Product 1
Process A
2021-05-01
10
9
15
Product 1
Process A
2021-05-03
15
10
19
Product 1
Process A
2021-05-04
19
15
NULL
Product 1
Process C
2021-05-05
对于每个产品,我需要标记 consecutive/islands 具有相同 Process 的记录,例如:
Id
PrevId
NextId
Product
Process
Date
Tag
1
NULL
4
Product 1
Process A
2021-04-24
1
4
1
7
Product 1
Process B
2021-04-26
2
7
4
9
Product 1
Process B
2021-04-29
2
9
7
10
Product 1
Process A
2021-05-01
3
10
9
15
Product 1
Process A
2021-05-03
3
15
10
19
Product 1
Process A
2021-05-04
3
19
15
NULL
Product 1
Process C
2021-05-05
4
一个产品要经过多个过程-es,并且可以多次经过同一个过程。
我基本上需要生成Tag列,其背后的逻辑是具有相同Process的连续记录应该组合在一起但是需要注意的是,相同的过程可能会出现在更远的地方,但应该被视为一个新的组。
我已经尝试了基本的窗口函数(ROW_NUMBER
和 DENSE_RANK
),但问题是那些计算 在 分区内而不是 跨个分区。
您可以使用 lag()
来确定值相同的地方。然后累加:
select t.*,
1 + sum(case when process = prev_process then 0 else 1 end) over (partition by producct order by id) as tag
from (select t.*,
lag(process) over (partition by product order by id) as prev_process
from t
) t;
Here 是一个 db<>fiddle.
如果您不必验证 prevId 和 nextId(也就是说,如果您的数据已经正确排序),您可以尝试以下操作:
WITH cte AS(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY Product ORDER BY [Date]) x
, DENSE_RANK() OVER (PARTITION BY Product, Process ORDER BY [Date]) y
FROM T1
WHERE product = 'Product 1'
),
cteTag AS(
SELECT Id, PrevId, NextId, Product, Process, [Date], x-y AS Tag_
FROM cte
)
SELECT Id, PrevId, NextId, Product, Process, [Date], DENSE_RANK() OVER (PARTITION BY Product ORDER BY Tag_) AS Tag
FROM cteTag
ORDER BY [Date]
我有这个数据集:
Id | PrevId | NextId | Product | Process | Date |
---|---|---|---|---|---|
1 | NULL | 4 | Product 1 | Process A | 2021-04-24 |
2 | NULL | 3 | Product 2 | Process A | 2021-04-24 |
3 | 2 | 5 | Product 2 | Process A | 2021-04-24 |
4 | 1 | 7 | Product 1 | Process B | 2021-04-26 |
5 | 3 | 6 | Product 2 | Process B | 2021-04-24 |
6 | 5 | NULL | Product 2 | Process B | 2021-04-24 |
7 | 4 | 9 | Product 1 | Process B | 2021-04-29 |
9 | 7 | 10 | Product 1 | Process A | 2021-05-01 |
10 | 9 | 15 | Product 1 | Process A | 2021-05-03 |
15 | 10 | 19 | Product 1 | Process A | 2021-05-04 |
19 | 15 | NULL | Product 1 | Process C | 2021-05-05 |
对于每个产品,我需要标记 consecutive/islands 具有相同 Process 的记录,例如:
Id | PrevId | NextId | Product | Process | Date | Tag |
---|---|---|---|---|---|---|
1 | NULL | 4 | Product 1 | Process A | 2021-04-24 | 1 |
4 | 1 | 7 | Product 1 | Process B | 2021-04-26 | 2 |
7 | 4 | 9 | Product 1 | Process B | 2021-04-29 | 2 |
9 | 7 | 10 | Product 1 | Process A | 2021-05-01 | 3 |
10 | 9 | 15 | Product 1 | Process A | 2021-05-03 | 3 |
15 | 10 | 19 | Product 1 | Process A | 2021-05-04 | 3 |
19 | 15 | NULL | Product 1 | Process C | 2021-05-05 | 4 |
一个产品要经过多个过程-es,并且可以多次经过同一个过程。
我基本上需要生成Tag列,其背后的逻辑是具有相同Process的连续记录应该组合在一起但是需要注意的是,相同的过程可能会出现在更远的地方,但应该被视为一个新的组。
我已经尝试了基本的窗口函数(ROW_NUMBER
和 DENSE_RANK
),但问题是那些计算 在 分区内而不是 跨个分区。
您可以使用 lag()
来确定值相同的地方。然后累加:
select t.*,
1 + sum(case when process = prev_process then 0 else 1 end) over (partition by producct order by id) as tag
from (select t.*,
lag(process) over (partition by product order by id) as prev_process
from t
) t;
Here 是一个 db<>fiddle.
如果您不必验证 prevId 和 nextId(也就是说,如果您的数据已经正确排序),您可以尝试以下操作:
WITH cte AS(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY Product ORDER BY [Date]) x
, DENSE_RANK() OVER (PARTITION BY Product, Process ORDER BY [Date]) y
FROM T1
WHERE product = 'Product 1'
),
cteTag AS(
SELECT Id, PrevId, NextId, Product, Process, [Date], x-y AS Tag_
FROM cte
)
SELECT Id, PrevId, NextId, Product, Process, [Date], DENSE_RANK() OVER (PARTITION BY Product ORDER BY Tag_) AS Tag
FROM cteTag
ORDER BY [Date]