使用时间刻度查找每个时间间隔的最新值

Question

我有精确到毫秒的时间序列数据。其中一些时间戳可能与确切时间重合，因此可以通过数据库 ID 列进行排序以确定哪个是最新的。

我正在尝试使用 Timescale 获取每秒的最新值。这是我正在查看的数据示例

time                     db_id  value
2020-01-01 08:39:23.293 | 4460 | 136.01 | 
2020-01-01 08:39:23.393 | 4461 | 197.95 | 
2020-01-01 08:40:38.973 | 4462 |  57.95 | 
2020-01-01 08:43:01.223 | 4463 |    156 | 
2020-01-01 08:43:26.577 | 4464 | 253.43 | 
2020-01-01 08:43:26.577 | 4465 |  53.68 | 
2020-01-01 08:43:26.577 | 4466 | 160.00 |

当获取每秒最新价格时，我的结果应该是这样的

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:39:24 | 197.95 |
.
.
.
2020-01-01 08:40:37 | 197.95 |
2020-01-01 08:40:38 | 57.95  |
2020-01-01 08:40:39 | 57.95  |
.
.
.
2020-01-01 08:43:25 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |  
2020-01-01 08:43:27 | 160.00 |
.
.
.

我已经使用时间刻度成功获得了每秒的最新结果time_bucket

SELECT last(value, db_id), time_bucket('1 seconds', time) AS per_second FROM timeseries GROUP BY per_second ORDER BY per_second DESC;

但它在时间栏中留下了漏洞。

time                 value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |

我想到的解决方案是创建一个具有每秒时间戳和空值的数据库，从之前的结果 table 中迁移数据，然后用最后出现的值替换空值，但这看起来很多中间步骤。

我想知道是否有更好的方法来解决每秒、每分钟、每小时等查找“最新值”的问题。我最初尝试使用 python 来解决这个问题像一个简单的问题，但它占用了大量的计算时间。

Answer 1

为我的问题找到了一个很好的工作解决方案。它涉及四个主要步骤：

获取最新值

    select 
        time_bucket('1 second', time + '1 second') as interval,
        last(val, db_id) as last_value
    from table
    where time  > <date_start> and time < <date_end>
    group by interval
    order by time;

这将生成具有最新值的 table。 last 还利用列以防需要另一级别的排序。例如

time                 last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |

请注意，我使用 + '1 second' 将时间移动一秒，因为我只需要之前特定秒的数据 - 没有这个它会考虑在第二数据作为最后价格的一部分。

创建一个 table 每秒有时间戳

    select 
        time_bucket_gapfill('1 second', time) as per_second
    from table
    where time  > <date_start> and time < <date_end>
    group by per_second
    order by per_second;

这里我生成一个 table，其中每一行都有每秒时间戳。

例如

per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000

将它们连接在一起并添加一个 value_partition 列

select
    per_second,
    last_value,
    sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
    (
        select 
            time_bucket('1 second', time + '1 second') as interval,
            last(val, db_id) as last_value
        from table
        where time  > <date_start> and time < <date_end>
        group by interval, time
    ) a
right join
    (
        select 
            time_bucket_gapfill('1 second', time) as per_second
        from table
        where time  > <date_start> and time < <date_end>
        group by per_second
    ) b
on a.interval = b.per_second

受到 this answer 的启发，我们的目标是让计数器 (value_partition) 仅在值不为 null 时才递增。

例如

per_second              latest_value value_partition
2020-01-01 00:00:00.000 NULL         0         
2020-01-01 00:00:01.000 15.82        1         
2020-01-01 00:00:02.000 NULL         1         
2020-01-01 00:00:03.000 NULL         1         
2020-01-01 00:00:04.000 NULL         1         
2020-01-01 00:00:05.000 NULL         1         
2020-01-01 00:00:06.000 NULL         1         
2020-01-01 00:00:07.000 NULL         1         
2020-01-01 00:00:08.000 NULL         1         
2020-01-01 00:00:09.000 NULL         1         
2020-01-01 00:00:10.000 15.72        2 
2020-01-01 00:00:10.000 14.67        3

填写空值

select
    per_second,
    first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
    select
        per_second,
        last_value,
        sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
    from
    (
            select 
                time_bucket('1 second', time + '1 second') as interval,
                last(val, db_id) as last_value
            from table
            where time  > <date_start> and time < <date_end>
            group by interval
        ) a
    right join
        (
            select 
                time_bucket_gapfill('1 second', time) as per_second
            from table
            where time  > <date_start> and time < <date_end>
            group by per_second
        ) b
    on a.interval = b.per_second
) as q

最后一步将所有内容整合在一起。这利用了 value_partition 列并相应地覆盖空值。

例如

per_second              latest_value
2020-01-01 00:00:00.000 NULL        
2020-01-01 00:00:01.000 15.82       
2020-01-01 00:00:02.000 15.82       
2020-01-01 00:00:03.000 15.82       
2020-01-01 00:00:04.000 15.82       
2020-01-01 00:00:05.000 15.82       
2020-01-01 00:00:06.000 15.82       
2020-01-01 00:00:07.000 15.82       
2020-01-01 00:00:08.000 15.82       
2020-01-01 00:00:09.000 15.82       
2020-01-01 00:00:10.000 15.72       
2020-01-01 00:00:10.000 14.67

使用时间刻度查找每个时间间隔的最新值

Using Timescale to find the latest value per interval

postgresql

etl

time-series

data-science

timescaledb