在 "gaps and island" 问题中创建 ("force") 岛
Create ("force") island in "gaps and island" problem
我有代码将我的数据划分为间隙和孤岛解决方案。数据本身根据记录的时间戳和活动报告用户 activity、工作时间和空闲时间。我的代码运行良好,但每隔一段时间我就会有一个 user_id 记录一个应用程序的一系列活动,空闲,然后 returns 到同一个应用程序以记录额外的 activity.根据我当前的代码,看起来用户在一个应用程序上花费了将近两个小时,而实际上中间有很长的停机时间。我想“强制”创建一个岛,如果活动之间的间隔超过 30 分钟,则重新启动分区。
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2
---------------------------------------------------
11/20/2020 10:55 A 9340 1 1
11/20/2020 10:55 A 9340 2 2
11/20/2020 10:58 A 9340 3 3
11/20/2020 10:58 A 9340 4 4
11/20/2020 10:59 A 9340 5 5
11/20/2020 13:09 A 9340 6 6
11/20/2020 13:09 A 9340 7 7
11/20/2020 13:10 A 9340 8 8
11/20/2020 13:10 A 9340 9 9
11/20/2020 17:12 A 8354 10 1
11/20/2020 17:14 A 8354 11 2
11/20/2020 17:14 A 8354 12 3
最终结果需要重新启动此示例中第六行的列 PR2 的分区,因为相同的记录活动之间的间隔超过 30 分钟 appl_id:
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2
---------------------------------------------------
11/20/2020 10:55 A 9340 1 1
11/20/2020 10:55 A 9340 2 2
11/20/2020 10:58 A 9340 3 3
11/20/2020 10:58 A 9340 4 4
11/20/2020 10:59 A 9340 5 5
11/20/2020 13:09 A 9340 6 1
11/20/2020 13:09 A 9340 7 2
11/20/2020 13:10 A 9340 8 3
11/20/2020 13:10 A 9340 9 4
11/20/2020 17:12 A 8354 10 1
11/20/2020 17:14 A 8354 11 2
11/20/2020 17:14 A 8354 12 3
这是我当前的代码:
select activity_date, user_id, appl_id,
row_number() over(partition by user_id order by activity_date) rn1,
row_number() over(partition by user_id, appl_id order by activity_date) rn2
from
(select
activity_date, user_id, appl_id, count(*)
from mytable tt
where
user_id in ('A', 'B', 'C')
and activity_date >= trunc(sysdate - 4,'DD')
and activity_date <= trunc(sysdate - 3,'DD')
group by
activity_date, user_id, appl_id) tt
您可以使用 MATCH_RECOGNIZE
:
SELECT activity_date,
user_id,
appl_id,
pr1,
ROW_NUMBER() OVER ( PARTITION BY user_id, appl_id, mno ORDER BY pr1 )
AS pr2
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY activity_date) AS pr1
FROM table_name t
)
MATCH_RECOGNIZE(
PARTITION BY user_id, appl_id
ORDER BY pr1
MEASURES
MATCH_NUMBER() AS mno
ALL ROWS PER MATCH
PATTERN ( activities* last_activity )
DEFINE activities AS
NEXT(activity_date) <= LAST(activity_date) + INTERVAL '30' MINUTE
)
ORDER BY user_id, pr1;
其中,对于示例数据:
CREATE TABLE table_name ( ACTIVITY_DATE, USER_ID, APPL_ID ) AS
SELECT DATE '2020-11-20' + INTERVAL '10:55' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:55' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:58' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:58' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:59' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:09' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:09' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:10' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:10' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:12' HOUR TO MINUTE, 'A', 8354 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:14' HOUR TO MINUTE, 'A', 8354 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:14' HOUR TO MINUTE, 'A', 8354 FROM DUAL;
输出:
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2
:------------------ | :------ | ------: | --: | --:
2020-11-20 10:55:00 | A | 9340 | 1 | 1
2020-11-20 10:55:00 | A | 9340 | 2 | 2
2020-11-20 10:58:00 | A | 9340 | 3 | 3
2020-11-20 10:58:00 | A | 9340 | 4 | 4
2020-11-20 10:59:00 | A | 9340 | 5 | 5
2020-11-20 13:09:00 | A | 9340 | 6 | 1
2020-11-20 13:09:00 | A | 9340 | 7 | 2
2020-11-20 13:10:00 | A | 9340 | 8 | 3
2020-11-20 13:10:00 | A | 9340 | 9 | 4
2020-11-20 17:12:00 | A | 8354 | 10 | 1
2020-11-20 17:14:00 | A | 8354 | 11 | 2
2020-11-20 17:14:00 | A | 8354 | 12 | 3
db<>fiddle here
我有代码将我的数据划分为间隙和孤岛解决方案。数据本身根据记录的时间戳和活动报告用户 activity、工作时间和空闲时间。我的代码运行良好,但每隔一段时间我就会有一个 user_id 记录一个应用程序的一系列活动,空闲,然后 returns 到同一个应用程序以记录额外的 activity.根据我当前的代码,看起来用户在一个应用程序上花费了将近两个小时,而实际上中间有很长的停机时间。我想“强制”创建一个岛,如果活动之间的间隔超过 30 分钟,则重新启动分区。
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2
---------------------------------------------------
11/20/2020 10:55 A 9340 1 1
11/20/2020 10:55 A 9340 2 2
11/20/2020 10:58 A 9340 3 3
11/20/2020 10:58 A 9340 4 4
11/20/2020 10:59 A 9340 5 5
11/20/2020 13:09 A 9340 6 6
11/20/2020 13:09 A 9340 7 7
11/20/2020 13:10 A 9340 8 8
11/20/2020 13:10 A 9340 9 9
11/20/2020 17:12 A 8354 10 1
11/20/2020 17:14 A 8354 11 2
11/20/2020 17:14 A 8354 12 3
最终结果需要重新启动此示例中第六行的列 PR2 的分区,因为相同的记录活动之间的间隔超过 30 分钟 appl_id:
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2
---------------------------------------------------
11/20/2020 10:55 A 9340 1 1
11/20/2020 10:55 A 9340 2 2
11/20/2020 10:58 A 9340 3 3
11/20/2020 10:58 A 9340 4 4
11/20/2020 10:59 A 9340 5 5
11/20/2020 13:09 A 9340 6 1
11/20/2020 13:09 A 9340 7 2
11/20/2020 13:10 A 9340 8 3
11/20/2020 13:10 A 9340 9 4
11/20/2020 17:12 A 8354 10 1
11/20/2020 17:14 A 8354 11 2
11/20/2020 17:14 A 8354 12 3
这是我当前的代码:
select activity_date, user_id, appl_id,
row_number() over(partition by user_id order by activity_date) rn1,
row_number() over(partition by user_id, appl_id order by activity_date) rn2
from
(select
activity_date, user_id, appl_id, count(*)
from mytable tt
where
user_id in ('A', 'B', 'C')
and activity_date >= trunc(sysdate - 4,'DD')
and activity_date <= trunc(sysdate - 3,'DD')
group by
activity_date, user_id, appl_id) tt
您可以使用 MATCH_RECOGNIZE
:
SELECT activity_date,
user_id,
appl_id,
pr1,
ROW_NUMBER() OVER ( PARTITION BY user_id, appl_id, mno ORDER BY pr1 )
AS pr2
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY activity_date) AS pr1
FROM table_name t
)
MATCH_RECOGNIZE(
PARTITION BY user_id, appl_id
ORDER BY pr1
MEASURES
MATCH_NUMBER() AS mno
ALL ROWS PER MATCH
PATTERN ( activities* last_activity )
DEFINE activities AS
NEXT(activity_date) <= LAST(activity_date) + INTERVAL '30' MINUTE
)
ORDER BY user_id, pr1;
其中,对于示例数据:
CREATE TABLE table_name ( ACTIVITY_DATE, USER_ID, APPL_ID ) AS
SELECT DATE '2020-11-20' + INTERVAL '10:55' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:55' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:58' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:58' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '10:59' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:09' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:09' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:10' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '13:10' HOUR TO MINUTE, 'A', 9340 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:12' HOUR TO MINUTE, 'A', 8354 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:14' HOUR TO MINUTE, 'A', 8354 FROM DUAL UNION ALL
SELECT DATE '2020-11-20' + INTERVAL '17:14' HOUR TO MINUTE, 'A', 8354 FROM DUAL;
输出:
ACTIVITY_DATE | USER_ID | APPL_ID | PR1 | PR2 :------------------ | :------ | ------: | --: | --: 2020-11-20 10:55:00 | A | 9340 | 1 | 1 2020-11-20 10:55:00 | A | 9340 | 2 | 2 2020-11-20 10:58:00 | A | 9340 | 3 | 3 2020-11-20 10:58:00 | A | 9340 | 4 | 4 2020-11-20 10:59:00 | A | 9340 | 5 | 5 2020-11-20 13:09:00 | A | 9340 | 6 | 1 2020-11-20 13:09:00 | A | 9340 | 7 | 2 2020-11-20 13:10:00 | A | 9340 | 8 | 3 2020-11-20 13:10:00 | A | 9340 | 9 | 4 2020-11-20 17:12:00 | A | 8354 | 10 | 1 2020-11-20 17:14:00 | A | 8354 | 11 | 2 2020-11-20 17:14:00 | A | 8354 | 12 | 3
db<>fiddle here