SQL 服务器 - 多天按分钟聚合数据

SQL Server - Aggregate data by minute over multiple days

上下文

我正在使用 Microsoft SQL Server 2016。

有一个数据库 table“Raw_data”,其中包含机器的状态及其启动时间。有几台机器,每台机器每分钟多次将其状态写入数据库。

为了减少数据量,我尝试将数据聚合成 1 分钟的块以保存以供进一步分析。由于容量限制,我想每隔几分钟执行一次此转换逻辑(例如计划的 SQL 服务器代理作业),删除原始数据并只保留聚合数据。

为了简化示例,我们假设“Raw_data”看起来像这样:

╔════╦════════════╦════════╦═════════════════════╗
║ id ║ fk_machine ║ status ║     created_at      ║
╠════╬════════════╬════════╬═════════════════════╣
║  1 ║       2222 ║      0 ║ 2020-08-19 22:15:00 ║
║  2 ║       2222 ║      3 ║ 2020-08-19 22:15:30 ║
║  3 ║       2222 ║      5 ║ 2020-08-19 23:07:00 ║
║  4 ║       2222 ║      1 ║ 2020-08-20 00:20:00 ║
║  5 ║       2222 ║      0 ║ 2020-08-20 00:45:00 ║
║  6 ║       2222 ║      5 ║ 2020-08-20 02:20:00 ║
╚════╩════════════╩════════╩═════════════════════╝

还有数据库 tables“Dim_date”和“Dim_time”,看起来像这样:

╔══════════╦══════════════╗
║ datekey  ║ date_iso8601 ║
╠══════════╬══════════════╣
║ 20200101 ║ 2020-01-01   ║
║ 20200102 ║ 2020-01-02   ║
║ ...      ║ ...          ║
║ 20351231 ║ 2035-12-31   ║
╚══════════╩══════════════╝

╔═════════╦══════════╦═════════════════╗
║ timekey ║ time_iso ║ min_lower_bound ║
╠═════════╬══════════╬═════════════════╣
║ 1       ║ 00:00:01 ║ 00:00:00        ║
║ 2       ║ 00:00:02 ║ 00:00:00        ║
║ ...     ║ ...      ║ ...             ║
║ 80345   ║ 08:03:45 ║ 08:03:00        ║
║ ...     ║ ...      ║ ...             ║
║ 134504  ║ 13:45:04 ║ 13:45:00        ║
║ 134505  ║ 14:45:05 ║ 13:45:00        ║
║ ...     ║ ...      ║ ...             ║
║ 235959  ║ 23:59:59 ║ 23:59:59        ║
╚═════════╩══════════╩═════════════════╝

结果应如下所示:

╔══════════════╦═════════════════╦════════════╦════════╦═══════════════╗
║ date_iso8601 ║ min_lower_bound ║ fk_machine ║ status ║ total_seconds ║
╠══════════════╬═════════════════╬════════════╬════════╬═══════════════╣
║ 2020-08-19   ║ 22:15:00        ║ 2222       ║ 0      ║ 30            ║
║ 2020-08-19   ║ 20:15:00        ║ 2222       ║ 3      ║ 30            ║
║ 2020-08-19   ║ 20:16:00        ║ 2222       ║ 3      ║ 60            ║
║ 2020-08-19   ║ 20:17:00        ║ 2222       ║ 3      ║ 60            ║
║ ...          ║ ...             ║ ...        ║ ...    ║ ...           ║
║ 2020-08-19   ║ 23:06:00        ║ 2222       ║ 3      ║ 60            ║
║ 2020-08-19   ║ 23:07:00        ║ 2222       ║ 5      ║ 60            ║
║ 2020-08-19   ║ 23:08:00        ║ 2222       ║ 5      ║ 60            ║
║ ...          ║ ...             ║ ...        ║ ...    ║ ...           ║
║ 2020-08-20   ║ 00:19:00        ║ 2222       ║ 5      ║ 60            ║
║ 2020-08-20   ║ 00:20:00        ║ 2222       ║ 1      ║ 60            ║
║ 2020-08-20   ║ 00:21:00        ║ 2222       ║ 1      ║ 60            ║
║ ...          ║ ...             ║ ...        ║ ...    ║ ...           ║
║ 2020-08-20   ║ 00:44:00        ║ 2222       ║ 1      ║ 60            ║
║ 2020-08-20   ║ 00:45:00        ║ 2222       ║ 0      ║ 60            ║
╚══════════════╩═════════════════╩════════════╩════════╩═══════════════╝

尝试

为了计算每个状态每分钟的持续时间,我使用 CTE and LEAD 从数据库 table 中的下一个状态获取开始日期和时间,然后加入维度 tables 并聚合结果。

WITH CTE_MACHINE_STATES(START_DATEKEY, 
                        START_TIMEKEY, 
                        FK_MACHINE, 
                        END_DATEKEY, 
                        END_TIMEKEY)
     AS (SELECT CAST(CONVERT(CHAR(8), CREATED_AT, 112) AS INT), -- ISO: yyyymmdd
                CONVERT(INT, REPLACE(CONVERT(CHAR(8), READING_TIME, 108), ':', '')), 
                FK_MACHINE, 
                STATUS, 
                CAST(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
                ORDER BY CREATED_AT), 112) AS INT),
                CONVERT(INT, REPLACE(CONVERT(CHAR(8), LEAD(CREATED_AT, 1) OVER(PARTITION BY FK_MACHINE
                ORDER BY CREATED_AT), 108), ':', ''))
         FROM RAW_DATA)
     SELECT DATE_ISO8601, 
            MIN_LOWER_BOUND, 
            FK_MACHINE, 
            STATUS, 
            SUM(1) AS TOTAL_SECONDS -- Duration
     FROM CTE_MACHINE_STATES
     CROSS JOIN DIM_DATE
     CROSS JOIN DIM_TIME
     WHERE TIMEKEY >= START_TIMEKEY AND 
           TIMEKEY < END_TIMEKEY AND 
           END_TIMEKEY IS NOT NULL AND -- last entry per machine and status
           DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY
     GROUP BY FK_MACHINE, 
              STATUS, 
              DATE_ISO8610, 
              MIN_LOWER_BOUND
     ORDER BY DATE_ISO8610, 
              MIN_LOWER_BOUND;

问题

如果状态持续到午夜之后,将无法正确汇总。例如,“Raw_data”中 id = 3 的状态从 23:07 开始,到第二天 00:20 结束。此处,timekey 大于 end_timekey,因此过滤器 TIMEKEY < END_TIMEKEY 将状态从结果 table 中排除。我还没有想出如何更改连接条件以包含这种持久状态的解决方案,但得到了预期的结果。

PS:我已经写过,通常状态更新每隔几秒发生一次。因此,问题只发生在边缘情况下,例如如果机器关闭。


解决方案

遗憾的是,我没有收到有关如何使用日期和时间维度 tables 获得预期结果的答案。但是 dnoeth 使用递归 CTE 的方法很好,所以我采用了它:

WITH cte_outer AS (
    SELECT fk_machine,
           status,
           created_at,
           DATEADD(minute, DATEDIFF(minute, '2000', created_at), '2000') AS min_lower_bound, --truncates seconds from start time
           LEAD(created_at) OVER(PARTITION BY fk_machine ORDER BY created_at) AS end_time
    FROM raw_data
),
    cte_recursive AS (
        SELECT fk_machine,
               status,
               min_lower_bound,
               end_time,
               CASE
                 WHEN end_time > DATEADD(minute, 1, min_lower_bound)
                 THEN DATEDIFF(s, created_at, DATEADD(minute, 1, min_lower_bound))
                 ELSE DATEDIFF(s, created_at, end_time)
               END AS total_seconds
        FROM cte_outer

        UNION ALL

        SELECT fk_machine,
               status,
               DATEADD(minute, 1, min_lower_bound), -- next time segment (minute)
               end_time,
               CASE
                 WHEN end_time >= DATEADD(minute, 2, min_lower_bound)
                 THEN 60
                 ELSE DATEDIFF(s, DATEADD(minute, 1, min_lower_bound), end_time)
               END
        FROM cte_recursive
        WHERE end_time > DATEADD(minute, 1, min_lower_bound)
)
SELECT min_lower_bound,
       fk_machine,
       status,
       total_seconds
FROM cte_recursive
ORDER BY  fk_machine, 
          min_lower_bound

对于这样的事情,将键连接到单个日期时间并不像看起来那么昂贵。然后您可以调用 DATEDIFF() 来检查比较的正值、负值、绝对值。我有 运行 类似的东西,可以将瞬时数据转换为数十年来的分钟聚合,而 datediff 确实有所不同。但是,如果您只是提取原始数据并使用具有良好日期时间库的语言执行计算,效果会好得多。 SQL 永远是答案,直到它不是。

可能导致此处问题之一的是以下语句:

WHERE TIMEKEY >= START_TIMEKEY AND 
              TIMEKEY < END_TIMEKEY AND 
              END_TIMEKEY IS NOT NULL AND 
              DATEKEY BETWEEN START_DATEKEY AND END_DATEKEY

如果日期和时间没有分开,你可以说:

WHERE DateTimeKey >= START_DateTimeKey AND 
              DateTimeKey < END_DateTimeKey AND 
              END_TIME-KEY IS NOT NULL

如果您尝试按时间值进行聚合,消除任何时间键会很有帮助 table,这可能是问题的另一个来源。用递归和周期持续时间替换时间键 table 可能是个好主意。您还需要考虑这些条件:

事件的结束时间必须始终在聚合时段开始时间的开始时间之后:

DateDiff(second, Period_Start_Time, Event_End) > 0

事件的开始时间必须始终在聚合期间结束时间结束之前:

DateDiff(second, Period_Start_Time, Event_Start) <= @Period_Duration

有多种方法可以跨时间段分布事件数据,但 datediff 也有助于线性分布。

这是递归 CTE 的 use-case,每次递归增加 created_at 一分钟:

with cte as 
 (
   select fk_machine
     ,status  
     ,start_minute
     ,end_time
     ,case
        when end_time > dateadd(minute, 1,start_minute)
        then datediff(s, created_at, dateadd(minute, 1,start_minute)) 
        else datediff(s, created_at, end_time )
      end as seconds
   from
    (
      select fk_machine
        ,status
        ,created_at 
        ,dateadd(minute, datediff(minute, 0, created_at), 0) as start_minute
        ,lead(created_at)
         over (PARTITION BY fk_machine
               order by created_at) as end_time
      from tab
    ) as dt
 
   union all
 
   select fk_machine
     ,status
     ,dateadd(minute, 1,start_minute)
     ,end_time
     ,case
        when end_time >= dateadd(minute, 2,start_minute)
        then 60
        else datediff(s, dateadd(minute, 1,start_minute), end_time)
      end
    from cte
    where end_time > dateadd(minute, 1,start_minute)
 )
select * from cte
order by 1,3,4;

fiddle