如何使用AWS athena提取特定事件前后的时间序列数据?

How to extract time series data before and after a specific event by using AWS athena?

我有存储大量时间序列数据的存储,我可以通过 AWS athena 提取数据。 但是,我不知道如何使用 AWS athena 提取特定事件前后的时间序列数据。

什么查询可以实现呢?

有没有人有关于 Athena 的想法和查询示例?

例如,我有以下输入数据。

<input data>

id | timestamp              | value | level |
---------------------------------------------
1  | 2021-04-01T12:00:00+00:00 | 100.0 | 1     |
2  | 2021-04-01T12:00:10+00:00 | 98.0  | 1     |
3  | 2021-04-01T12:00:20+00:00 | 99.5  | 1     |
...
58 | 2021-04-01T12:09:40+00:00 | 98.2  | 1     |
59 | 2021-04-01T12:09:50+00:00 | 95.3  | 1     |
60 | 2021-04-01T12:10:00+00:00 | 99.2  | 1     |
61 | 2021-04-01T12:10:10+00:00 | 97.6  | 2     |
62 | 2021-04-01T12:10:20+00:00 | 98.6  | 2     |
63 | 2021-04-01T12:10:30+00:00 | 98.3  | 2     |
64 | 2021-04-01T12:10:40+00:00 | 98.1  | 2     |
...
100 | 2021-04-01T12:16:40+00:00 | 97.6  | 2     |

我想做的是提取 level 1->2 更改事件前后 30 秒的记录。

在这种情况下,预期输出是 从 id:58 到 id:64 的数据。

您可以使用'lag'函数来确定级别变化的时间戳:

SELECT *
FROM (SELECT *
      FROM (SELECT timestamp,
                   lag(level) OVER (order by timestamp) AS prev_level,
                   level
            FROM dataset)
      WHERE prev_level != level)

然后使用这些时间戳过滤出数据集。例如这样的事情:

WITH dataset(id,timestamp,value,level) AS (
    VALUES 

('1',timestamp '2021-04-01 12:00:00+00:00',100.0,1),
('2',timestamp '2021-04-01 12:00:10+00:00',98.0,1),
('3',timestamp '2021-04-01 12:00:20+00:00',99.5,1),
('58',timestamp '2021-04-01 12:09:40+00:00',98.2,1),
('59',timestamp '2021-04-01 12:09:50+00:00',95.3,1),
('60',timestamp '2021-04-01 12:10:00+00:00',99.2,1),
('61',timestamp '2021-04-01 12:10:10+00:00',97.6,2),
('62',timestamp '2021-04-01 12:10:20+00:00',98.6,2),
('63',timestamp '2021-04-01 12:10:30+00:00',98.3,2),
('64',timestamp '2021-04-01 12:10:40+00:00',98.1,2),
('100',timestamp '2021-04-01 12:16:40+00:00',97.6,2)
)


SELECT *
FROM dataset o
WHERE EXISTS(
              SELECT *
              FROM (SELECT *
                    FROM (SELECT timestamp,
                                 lag(level) OVER (order by timestamp) AS prev_level,
                                 level
                          FROM dataset)
                    WHERE prev_level != level)
              WHERE (o.level = level AND o.timestamp - timestamp < interval '30' second)
                 OR (o.level = prev_level AND timestamp - o.timestamp < interval '30' second)
          )

输出:

id timestamp value level
59 2021-04-01 12:09:50.000 UTC 95.3 1
60 2021-04-01 12:10:00.000 UTC 99.2 1
61 2021-04-01 12:10:10.000 UTC 97.6 2
62 2021-04-01 12:10:20.000 UTC 98.6 2
63 2021-04-01 12:10:30.000 UTC 98.3 2