每秒获取 运行 开始但尚未完成的所有行的计数和值总和
Get running count and sums of value of all rows that are started but not yet finished for each second
我有一些事件数据如下所示:
| time | id | status | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2 | started | 8 |
| 2020-08-26T21:29:01.000 | 3 | started | 4 |
| 2020-08-26T21:29:02.000 | 2 | finished | 8 |
| 2020-08-26T21:29:03.000 | 4 | started | 12 |
| 2020-08-26T21:29:04.000 | 5 | started | 2 |
| 2020-08-26T21:29:05.000 | 6 | started | 24 |
| 2020-08-26T21:29:06.000 | 4 | finished | 12 |
| 2020-08-26T21:29:06.000 | 3 | finished | 4 |
| 2020-08-26T21:29:07.000 | 1 | finished | 1 |
| 2020-08-26T21:29:10.000 | 7 | started | 4 |
请注意,事件数据是在事情开始后才开始记录的,还有一些事件尚未结束。
然后我尝试获取 运行 行数和 运行 每秒值的总和。
只要我想到 运行 计数,我就会想到 window 查询,但我正在努力弄清楚如何从这些数据获得我期望的输出。
理想情况下,我希望得到以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1 | 1 |
| 2020-08-26T21:29:01.000 | 3 | 13 |
| 2020-08-26T21:29:02.000 | 2 | 5 |
| 2020-08-26T21:29:03.000 | 3 | 17 |
| 2020-08-26T21:29:04.000 | 4 | 19 |
| 2020-08-26T21:29:05.000 | 5 | 43 |
| 2020-08-26T21:29:06.000 | 3 | 29 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
我也很高兴回答没有考虑 1
在事件开始记录之前 运行 的 id 记录,然后会产生以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0 | 0 |
| 2020-08-26T21:29:01.000 | 2 | 12 |
| 2020-08-26T21:29:02.000 | 1 | 4 |
| 2020-08-26T21:29:03.000 | 2 | 16 |
| 2020-08-26T21:29:04.000 | 3 | 18 |
| 2020-08-26T21:29:05.000 | 4 | 42 |
| 2020-08-26T21:29:06.000 | 2 | 28 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
由于 Athena/Presto 不支持完全连接,我能够通过以下查询(也在 SQL Fiddle 上)获得每个 id
的开始和停止时间:
WITH started AS (
SELECT *
FROM foo
WHERE status = 'started'
), finished AS (
SELECT *
FROM foo
WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id
我想你想要一个累积条件和:
select time,
sum(sum(case when status = 'started' then 1
when status = 'finished' then -1
end)
) over (order by time) as running_count,
sum(sum(case when status = 'started' then value
when status = 'finished' then - value
end)
) over (order by time) as running_value
from foo
group by time
sum()
需要嵌套,因为window函数需要一个,聚合需要一个
我有一些事件数据如下所示:
| time | id | status | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2 | started | 8 |
| 2020-08-26T21:29:01.000 | 3 | started | 4 |
| 2020-08-26T21:29:02.000 | 2 | finished | 8 |
| 2020-08-26T21:29:03.000 | 4 | started | 12 |
| 2020-08-26T21:29:04.000 | 5 | started | 2 |
| 2020-08-26T21:29:05.000 | 6 | started | 24 |
| 2020-08-26T21:29:06.000 | 4 | finished | 12 |
| 2020-08-26T21:29:06.000 | 3 | finished | 4 |
| 2020-08-26T21:29:07.000 | 1 | finished | 1 |
| 2020-08-26T21:29:10.000 | 7 | started | 4 |
请注意,事件数据是在事情开始后才开始记录的,还有一些事件尚未结束。
然后我尝试获取 运行 行数和 运行 每秒值的总和。
只要我想到 运行 计数,我就会想到 window 查询,但我正在努力弄清楚如何从这些数据获得我期望的输出。
理想情况下,我希望得到以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1 | 1 |
| 2020-08-26T21:29:01.000 | 3 | 13 |
| 2020-08-26T21:29:02.000 | 2 | 5 |
| 2020-08-26T21:29:03.000 | 3 | 17 |
| 2020-08-26T21:29:04.000 | 4 | 19 |
| 2020-08-26T21:29:05.000 | 5 | 43 |
| 2020-08-26T21:29:06.000 | 3 | 29 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
我也很高兴回答没有考虑 1
在事件开始记录之前 运行 的 id 记录,然后会产生以下结果:
| time | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0 | 0 |
| 2020-08-26T21:29:01.000 | 2 | 12 |
| 2020-08-26T21:29:02.000 | 1 | 4 |
| 2020-08-26T21:29:03.000 | 2 | 16 |
| 2020-08-26T21:29:04.000 | 3 | 18 |
| 2020-08-26T21:29:05.000 | 4 | 42 |
| 2020-08-26T21:29:06.000 | 2 | 28 |
| 2020-08-26T21:29:07.000 | 2 | 28 |
| 2020-08-26T21:29:08.000 | 2 | 28 |
| 2020-08-26T21:29:09.000 | 2 | 28 |
| 2020-08-26T21:29:10.000 | 3 | 32 |
| 2020-08-26T21:29:11.000 | 3 | 32 |
由于 Athena/Presto 不支持完全连接,我能够通过以下查询(也在 SQL Fiddle 上)获得每个 id
的开始和停止时间:
WITH started AS (
SELECT *
FROM foo
WHERE status = 'started'
), finished AS (
SELECT *
FROM foo
WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id
我想你想要一个累积条件和:
select time,
sum(sum(case when status = 'started' then 1
when status = 'finished' then -1
end)
) over (order by time) as running_count,
sum(sum(case when status = 'started' then value
when status = 'finished' then - value
end)
) over (order by time) as running_value
from foo
group by time
sum()
需要嵌套,因为window函数需要一个,聚合需要一个