每秒获取 运行 开始但尚未完成的所有行的计数和值总和

Get running count and sums of value of all rows that are started but not yet finished for each second

我有一些事件数据如下所示:

| time                    | id | status   | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2  | started  | 8     |
| 2020-08-26T21:29:01.000 | 3  | started  | 4     |
| 2020-08-26T21:29:02.000 | 2  | finished | 8     |
| 2020-08-26T21:29:03.000 | 4  | started  | 12    |
| 2020-08-26T21:29:04.000 | 5  | started  | 2     |
| 2020-08-26T21:29:05.000 | 6  | started  | 24    |
| 2020-08-26T21:29:06.000 | 4  | finished | 12    |
| 2020-08-26T21:29:06.000 | 3  | finished | 4     |
| 2020-08-26T21:29:07.000 | 1  | finished | 1     |
| 2020-08-26T21:29:10.000 | 7  | started  | 4     |

请注意,事件数据是在事情开始后才开始记录的,还有一些事件尚未结束。

然后我尝试获取 运行 行数和 运行 每秒值的总和。

只要我想到 运行 计数,我就会想到 window 查询,但我正在努力弄清楚如何从这些数据获得我期望的输出。

理想情况下,我希望得到以下结果:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1     | 1          |
| 2020-08-26T21:29:01.000 | 3     | 13         |
| 2020-08-26T21:29:02.000 | 2     | 5          |
| 2020-08-26T21:29:03.000 | 3     | 17         |
| 2020-08-26T21:29:04.000 | 4     | 19         |
| 2020-08-26T21:29:05.000 | 5     | 43         |
| 2020-08-26T21:29:06.000 | 3     | 29         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

我也很高兴回答没有考虑 1 在事件开始记录之前 运行 的 id 记录,然后会产生以下结果:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0     | 0          |
| 2020-08-26T21:29:01.000 | 2     | 12         |
| 2020-08-26T21:29:02.000 | 1     | 4          |
| 2020-08-26T21:29:03.000 | 2     | 16         |
| 2020-08-26T21:29:04.000 | 3     | 18         |
| 2020-08-26T21:29:05.000 | 4     | 42         |
| 2020-08-26T21:29:06.000 | 2     | 28         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

由于 Athena/Presto 不支持完全连接,我能够通过以下查询(也在 SQL Fiddle 上)获得每个 id 的开始和停止时间:

WITH started AS (
  SELECT *
  FROM foo
  WHERE status = 'started'
), finished AS (
  SELECT *
  FROM foo
  WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id

我想你想要一个累积条件和:

select time,
       sum(sum(case when status = 'started' then 1
                    when status = 'finished' then -1
               end)
          ) over (order by time) as running_count,
       sum(sum(case when status = 'started' then value
                    when status = 'finished' then - value
                end)
          ) over (order by time) as running_value
from foo
group by time

sum()需要嵌套,因为window函数需要一个,聚合需要一个