SQL - 将时间序列事件转换为 On/Off 对(处理可能缺失的开或关)
SQL - Convert Time Series Events into On/Off Pairs (handling potential missing On's or Off's)
在SQL服务器中,我有一组时间序列on/off事件,看起来像这样(为简单起见,我只显示了一个警报编号,但相同的table):
'Alarms' Table:
AlarmNumber Time AlarmState
1592 2020-01-02 01:52:02 1
1592 2020-01-02 01:58:07 0
1592 2020-04-28 03:46:49 1
1592 2020-04-28 06:19:10 0
1592 2020-06-04 00:25:22 1
1592 2020-08-27 01:57:03 1
1592 2020-08-27 05:16:32 0
1592 2020-09-17 02:51:57 0
我正在尝试将其转换成 On/Off 对:
Output I am trying to achieve, ideally as an SQL View:
AlarmNumber StartTime EndTime
1592 2020-01-02 01:52:02 2020-01-02 01:58:07
1592 2020-04-28 03:46:49 2020-04-28 06:19:10
1592 2020-06-04 00:25:22 NULL
1592 2020-08-27 01:57:03 2020-08-27 05:16:32
1592 NULL 2020-09-17 02:51:57
如果我有一个干净的数据集,没有丢失 'On' 或 'Off' 事件,我可以通过以下方式实现:
select tOn.AlarmNumber, tOn.Time StartTime, tOff.Time EndTime
from (
select AlarmNumber, Time,
ROW_NUMBER() Over(Partition by AlarmNumber order by Time) EventID
from Alarms where AlarmState = 1
) tOn
LEFT JOIN (
select AlarmNumber, Time,
ROW_NUMBER() Over(Partition by AlarmNumber order by Time) EventID
from Alarms where AlarmState = 0
) tOff
on (tOn.AlarmNumber = tOff.AlarmNumber and tOn.EventID = tOff.EventID)
(代码修改自 Adriano Carneiro 在 T-SQL Start and end date times from a single column 的回答)
我的问题:谁能想出一种有效的方法来处理 'Alarms' table 以实现我的示例输出,它处理丢失的 on/off 事件(在示例中显示为 NULL输出)?
我的备份是使用 Cursor 和 Where 循环,但我希望有一种方法可以通过将 On/Off 事件对组合在一起来实现,我只是没能得到它工作。我有 500k+ 个事件,因此这是一个需要迭代的大型数据集。
欢迎提出任何想法!
谢谢,
托马斯
------ 2020 年 11 月 1 日更新 ------
已经提供了两个很好的解决方案,它们都可以正常工作,并且在 80,000 行混乱的现实世界数据样本上提供相同的结果。
- GMB 的解决方案更易于阅读,但比 运行
慢一点
- gotqn 的解决方案是代码行数更多,但 运行 在我的测试服务器上 运行 快了大约 50%
一旦有了行的顺序,只需 SELECT
将它们分成几部分,然后使用 UNION ALL
:
合并结果
DECLARE @DataSource TABLE
(
[AlarmNumber] INT
,[Time] DATETIME2(0)
,[AlarmState] INT
);
INSERT INTO @DataSource ([AlarmNumber], [Time], [AlarmState])
VALUES (1592, '2020-01-02 01:52:02', 1)
,(1592, '2020-01-02 01:58:07', 0)
,(1592, '2020-04-28 03:46:49', 1)
,(1592, '2020-04-28 06:19:10', 0)
,(1592, '2020-06-04 00:25:22', 1)
,(1592, '2020-08-27 01:57:03', 1)
,(1592, '2020-08-27 05:16:32', 0)
,(1592, '2020-09-17 02:51:57', 0);
-- Add a rowID column to the data
WITH DataSource AS
(
SELECT * ,ROW_NUMBER() Over(Partition by AlarmNumber order by [Time]) rowID
FROM @DataSource
)
-- This is just here so we can sort the result at the end
SELECT * FROM (
-- Select rows of DataSource where there is an ON and subsequent OFF event (DS1 Alarm is ON and DS2 Alarm is OFF)
-- This also catches where there is an ON, but no subsequent OFF (DS2.Time will be NULL)
SELECT DS1.AlarmNumber
,DS1.Time As StartTime
,DS2.Time As EndTime
FROM DataSource DS1
LEFT JOIN DataSource DS2
ON DS1.[rowID] = DS2.[rowID] - 1
AND DS1.AlarmNumber = DS2.AlarmNumber
AND DS2.[AlarmState] = 0
WHERE DS1.[AlarmState] = 1
UNION ALL
-- Select rows of DataSource where there is an OFF and there is no matching ON (aka it turned OFF without ever turning ON)
SELECT DS2.AlarmNumber
,NULL As StartTime
,DS2.Time As EndTime
FROM DataSource DS2
INNER JOIN DataSource DS1
ON DS2.[rowID] -1 = DS1.[rowID]
AND DS1.[AlarmState] = 0
AND DS2.AlarmNumber = DS1.AlarmNumber
WHERE DS2.[AlarmState] = 0
UNION ALL
-- Select rows of DataSource where the first event for this alarm number is an OFF (it would otherwise be missed by the above)
SELECT DS1.AlarmNumber
,NULL As StartTime
,DS1.Time As EndTime
FROM DataSource DS1
WHERE DS1.[AlarmState] = 0 AND DS1.rowID = 1
) z
ORDER BY COALESCE(StartTime,EndTime), AlarmNumber
一个组由两个连续的行组成,其中第一行的状态为 1,第二行的状态为 0。我将使用 window 函数来解决这个问题,如下所示:
select
alarmnumber,
max(case when alarmstate = 1 then time end) start_time,
max(case when alarmstate = 0 then time end) end_time
from (
select a.*,
sum(case when alarmstate = 0 and lag_alarmstate = 1 then 0 else 1 end)
over(partition by alarmnumber order by time) grp
from (
select a.*,
lag(alarmstate) over(partition by alarmnumber order by time) lag_alarmstate
from alarms a
) a
) a
group by alarmnumber, grp
这使用 lag()
检索“先前”状态,并使用累计和来定义组。最后一步是条件聚合。
alarmnumber | start_time | end_time
:---------- | :---------------------- | :----------------------
1592 | 2020-01-02 01:52:02.000 | 2020-01-02 01:58:07.000
1592 | 2020-04-28 03:46:49.000 | 2020-04-28 06:19:10.000
1592 | 2020-06-04 00:25:22.000 | null
1592 | 2020-08-27 01:57:03.000 | 2020-08-27 05:16:32.000
1592 | null | 2020-09-17 02:51:57.000
在SQL服务器中,我有一组时间序列on/off事件,看起来像这样(为简单起见,我只显示了一个警报编号,但相同的table):
'Alarms' Table:
AlarmNumber Time AlarmState
1592 2020-01-02 01:52:02 1
1592 2020-01-02 01:58:07 0
1592 2020-04-28 03:46:49 1
1592 2020-04-28 06:19:10 0
1592 2020-06-04 00:25:22 1
1592 2020-08-27 01:57:03 1
1592 2020-08-27 05:16:32 0
1592 2020-09-17 02:51:57 0
我正在尝试将其转换成 On/Off 对:
Output I am trying to achieve, ideally as an SQL View:
AlarmNumber StartTime EndTime
1592 2020-01-02 01:52:02 2020-01-02 01:58:07
1592 2020-04-28 03:46:49 2020-04-28 06:19:10
1592 2020-06-04 00:25:22 NULL
1592 2020-08-27 01:57:03 2020-08-27 05:16:32
1592 NULL 2020-09-17 02:51:57
如果我有一个干净的数据集,没有丢失 'On' 或 'Off' 事件,我可以通过以下方式实现:
select tOn.AlarmNumber, tOn.Time StartTime, tOff.Time EndTime
from (
select AlarmNumber, Time,
ROW_NUMBER() Over(Partition by AlarmNumber order by Time) EventID
from Alarms where AlarmState = 1
) tOn
LEFT JOIN (
select AlarmNumber, Time,
ROW_NUMBER() Over(Partition by AlarmNumber order by Time) EventID
from Alarms where AlarmState = 0
) tOff
on (tOn.AlarmNumber = tOff.AlarmNumber and tOn.EventID = tOff.EventID)
(代码修改自 Adriano Carneiro 在 T-SQL Start and end date times from a single column 的回答)
我的问题:谁能想出一种有效的方法来处理 'Alarms' table 以实现我的示例输出,它处理丢失的 on/off 事件(在示例中显示为 NULL输出)?
我的备份是使用 Cursor 和 Where 循环,但我希望有一种方法可以通过将 On/Off 事件对组合在一起来实现,我只是没能得到它工作。我有 500k+ 个事件,因此这是一个需要迭代的大型数据集。
欢迎提出任何想法!
谢谢, 托马斯
------ 2020 年 11 月 1 日更新 ------
已经提供了两个很好的解决方案,它们都可以正常工作,并且在 80,000 行混乱的现实世界数据样本上提供相同的结果。
- GMB 的解决方案更易于阅读,但比 运行 慢一点
- gotqn 的解决方案是代码行数更多,但 运行 在我的测试服务器上 运行 快了大约 50%
一旦有了行的顺序,只需 SELECT
将它们分成几部分,然后使用 UNION ALL
:
DECLARE @DataSource TABLE
(
[AlarmNumber] INT
,[Time] DATETIME2(0)
,[AlarmState] INT
);
INSERT INTO @DataSource ([AlarmNumber], [Time], [AlarmState])
VALUES (1592, '2020-01-02 01:52:02', 1)
,(1592, '2020-01-02 01:58:07', 0)
,(1592, '2020-04-28 03:46:49', 1)
,(1592, '2020-04-28 06:19:10', 0)
,(1592, '2020-06-04 00:25:22', 1)
,(1592, '2020-08-27 01:57:03', 1)
,(1592, '2020-08-27 05:16:32', 0)
,(1592, '2020-09-17 02:51:57', 0);
-- Add a rowID column to the data
WITH DataSource AS
(
SELECT * ,ROW_NUMBER() Over(Partition by AlarmNumber order by [Time]) rowID
FROM @DataSource
)
-- This is just here so we can sort the result at the end
SELECT * FROM (
-- Select rows of DataSource where there is an ON and subsequent OFF event (DS1 Alarm is ON and DS2 Alarm is OFF)
-- This also catches where there is an ON, but no subsequent OFF (DS2.Time will be NULL)
SELECT DS1.AlarmNumber
,DS1.Time As StartTime
,DS2.Time As EndTime
FROM DataSource DS1
LEFT JOIN DataSource DS2
ON DS1.[rowID] = DS2.[rowID] - 1
AND DS1.AlarmNumber = DS2.AlarmNumber
AND DS2.[AlarmState] = 0
WHERE DS1.[AlarmState] = 1
UNION ALL
-- Select rows of DataSource where there is an OFF and there is no matching ON (aka it turned OFF without ever turning ON)
SELECT DS2.AlarmNumber
,NULL As StartTime
,DS2.Time As EndTime
FROM DataSource DS2
INNER JOIN DataSource DS1
ON DS2.[rowID] -1 = DS1.[rowID]
AND DS1.[AlarmState] = 0
AND DS2.AlarmNumber = DS1.AlarmNumber
WHERE DS2.[AlarmState] = 0
UNION ALL
-- Select rows of DataSource where the first event for this alarm number is an OFF (it would otherwise be missed by the above)
SELECT DS1.AlarmNumber
,NULL As StartTime
,DS1.Time As EndTime
FROM DataSource DS1
WHERE DS1.[AlarmState] = 0 AND DS1.rowID = 1
) z
ORDER BY COALESCE(StartTime,EndTime), AlarmNumber
一个组由两个连续的行组成,其中第一行的状态为 1,第二行的状态为 0。我将使用 window 函数来解决这个问题,如下所示:
select
alarmnumber,
max(case when alarmstate = 1 then time end) start_time,
max(case when alarmstate = 0 then time end) end_time
from (
select a.*,
sum(case when alarmstate = 0 and lag_alarmstate = 1 then 0 else 1 end)
over(partition by alarmnumber order by time) grp
from (
select a.*,
lag(alarmstate) over(partition by alarmnumber order by time) lag_alarmstate
from alarms a
) a
) a
group by alarmnumber, grp
这使用 lag()
检索“先前”状态,并使用累计和来定义组。最后一步是条件聚合。
alarmnumber | start_time | end_time :---------- | :---------------------- | :---------------------- 1592 | 2020-01-02 01:52:02.000 | 2020-01-02 01:58:07.000 1592 | 2020-04-28 03:46:49.000 | 2020-04-28 06:19:10.000 1592 | 2020-06-04 00:25:22.000 | null 1592 | 2020-08-27 01:57:03.000 | 2020-08-27 05:16:32.000 1592 | null | 2020-09-17 02:51:57.000