如何使用AWS athena提取特定事件前后的时间序列数据?
How to extract time series data before and after a specific event by using AWS athena?
我有存储大量时间序列数据的存储,我可以通过 AWS athena 提取数据。
但是,我不知道如何使用 AWS athena 提取特定事件前后的时间序列数据。
什么查询可以实现呢?
有没有人有关于 Athena 的想法和查询示例?
例如,我有以下输入数据。
<input data>
id | timestamp | value | level |
---------------------------------------------
1 | 2021-04-01T12:00:00+00:00 | 100.0 | 1 |
2 | 2021-04-01T12:00:10+00:00 | 98.0 | 1 |
3 | 2021-04-01T12:00:20+00:00 | 99.5 | 1 |
...
58 | 2021-04-01T12:09:40+00:00 | 98.2 | 1 |
59 | 2021-04-01T12:09:50+00:00 | 95.3 | 1 |
60 | 2021-04-01T12:10:00+00:00 | 99.2 | 1 |
61 | 2021-04-01T12:10:10+00:00 | 97.6 | 2 |
62 | 2021-04-01T12:10:20+00:00 | 98.6 | 2 |
63 | 2021-04-01T12:10:30+00:00 | 98.3 | 2 |
64 | 2021-04-01T12:10:40+00:00 | 98.1 | 2 |
...
100 | 2021-04-01T12:16:40+00:00 | 97.6 | 2 |
我想做的是提取 level 1->2
更改事件前后 30 秒的记录。
在这种情况下,预期输出是 从 id:58 到 id:64 的数据。
您可以使用'lag'函数来确定级别变化的时间戳:
SELECT *
FROM (SELECT *
FROM (SELECT timestamp,
lag(level) OVER (order by timestamp) AS prev_level,
level
FROM dataset)
WHERE prev_level != level)
然后使用这些时间戳过滤出数据集。例如这样的事情:
WITH dataset(id,timestamp,value,level) AS (
VALUES
('1',timestamp '2021-04-01 12:00:00+00:00',100.0,1),
('2',timestamp '2021-04-01 12:00:10+00:00',98.0,1),
('3',timestamp '2021-04-01 12:00:20+00:00',99.5,1),
('58',timestamp '2021-04-01 12:09:40+00:00',98.2,1),
('59',timestamp '2021-04-01 12:09:50+00:00',95.3,1),
('60',timestamp '2021-04-01 12:10:00+00:00',99.2,1),
('61',timestamp '2021-04-01 12:10:10+00:00',97.6,2),
('62',timestamp '2021-04-01 12:10:20+00:00',98.6,2),
('63',timestamp '2021-04-01 12:10:30+00:00',98.3,2),
('64',timestamp '2021-04-01 12:10:40+00:00',98.1,2),
('100',timestamp '2021-04-01 12:16:40+00:00',97.6,2)
)
SELECT *
FROM dataset o
WHERE EXISTS(
SELECT *
FROM (SELECT *
FROM (SELECT timestamp,
lag(level) OVER (order by timestamp) AS prev_level,
level
FROM dataset)
WHERE prev_level != level)
WHERE (o.level = level AND o.timestamp - timestamp < interval '30' second)
OR (o.level = prev_level AND timestamp - o.timestamp < interval '30' second)
)
输出:
id
timestamp
value
level
59
2021-04-01 12:09:50.000 UTC
95.3
1
60
2021-04-01 12:10:00.000 UTC
99.2
1
61
2021-04-01 12:10:10.000 UTC
97.6
2
62
2021-04-01 12:10:20.000 UTC
98.6
2
63
2021-04-01 12:10:30.000 UTC
98.3
2
我有存储大量时间序列数据的存储,我可以通过 AWS athena 提取数据。 但是,我不知道如何使用 AWS athena 提取特定事件前后的时间序列数据。
什么查询可以实现呢?
有没有人有关于 Athena 的想法和查询示例?
例如,我有以下输入数据。
<input data>
id | timestamp | value | level |
---------------------------------------------
1 | 2021-04-01T12:00:00+00:00 | 100.0 | 1 |
2 | 2021-04-01T12:00:10+00:00 | 98.0 | 1 |
3 | 2021-04-01T12:00:20+00:00 | 99.5 | 1 |
...
58 | 2021-04-01T12:09:40+00:00 | 98.2 | 1 |
59 | 2021-04-01T12:09:50+00:00 | 95.3 | 1 |
60 | 2021-04-01T12:10:00+00:00 | 99.2 | 1 |
61 | 2021-04-01T12:10:10+00:00 | 97.6 | 2 |
62 | 2021-04-01T12:10:20+00:00 | 98.6 | 2 |
63 | 2021-04-01T12:10:30+00:00 | 98.3 | 2 |
64 | 2021-04-01T12:10:40+00:00 | 98.1 | 2 |
...
100 | 2021-04-01T12:16:40+00:00 | 97.6 | 2 |
我想做的是提取 level 1->2
更改事件前后 30 秒的记录。
在这种情况下,预期输出是 从 id:58 到 id:64 的数据。
您可以使用'lag'函数来确定级别变化的时间戳:
SELECT *
FROM (SELECT *
FROM (SELECT timestamp,
lag(level) OVER (order by timestamp) AS prev_level,
level
FROM dataset)
WHERE prev_level != level)
然后使用这些时间戳过滤出数据集。例如这样的事情:
WITH dataset(id,timestamp,value,level) AS (
VALUES
('1',timestamp '2021-04-01 12:00:00+00:00',100.0,1),
('2',timestamp '2021-04-01 12:00:10+00:00',98.0,1),
('3',timestamp '2021-04-01 12:00:20+00:00',99.5,1),
('58',timestamp '2021-04-01 12:09:40+00:00',98.2,1),
('59',timestamp '2021-04-01 12:09:50+00:00',95.3,1),
('60',timestamp '2021-04-01 12:10:00+00:00',99.2,1),
('61',timestamp '2021-04-01 12:10:10+00:00',97.6,2),
('62',timestamp '2021-04-01 12:10:20+00:00',98.6,2),
('63',timestamp '2021-04-01 12:10:30+00:00',98.3,2),
('64',timestamp '2021-04-01 12:10:40+00:00',98.1,2),
('100',timestamp '2021-04-01 12:16:40+00:00',97.6,2)
)
SELECT *
FROM dataset o
WHERE EXISTS(
SELECT *
FROM (SELECT *
FROM (SELECT timestamp,
lag(level) OVER (order by timestamp) AS prev_level,
level
FROM dataset)
WHERE prev_level != level)
WHERE (o.level = level AND o.timestamp - timestamp < interval '30' second)
OR (o.level = prev_level AND timestamp - o.timestamp < interval '30' second)
)
输出:
id | timestamp | value | level |
---|---|---|---|
59 | 2021-04-01 12:09:50.000 UTC | 95.3 | 1 |
60 | 2021-04-01 12:10:00.000 UTC | 99.2 | 1 |
61 | 2021-04-01 12:10:10.000 UTC | 97.6 | 2 |
62 | 2021-04-01 12:10:20.000 UTC | 98.6 | 2 |
63 | 2021-04-01 12:10:30.000 UTC | 98.3 | 2 |