具有不规则时间序列的时间加权平均聚合功能的时间序列数据库？

Question

我们的传感器以不规则的时间间隔产生值：

12:00 10 12:02 20 12:22 30 12:2940

我试图找到一个时间序列数据库，它可以自动计算特定固定时间间隔（例如 10 分钟）的平均值。当然，一个值在区间内有效的时间越长，它在平均值（时间加权平均值）中的权重就越大。 (例如 12:00-12:10: (10*2+20*8)/10=18) )

我现在在互联网上搜索了几个小时，发现很多时间序列数据库都在讨论不规则时间序列（例如 InfluxDB、OpenTDSB 等），其中大多数都有一些类似 SQL 的查询具有聚合函数的语言。

不幸的是，他们没有说明不规则时间间隔的平均精确度。由于我不想尝试所有这些，有人可以告诉我哪些数据库支持时间加权平均值的计算吗？谢谢！

Answer 1

OpenTSDB 在查询暗示的时间跨查询中的所有系列执行聚合。对于任何在时间戳处没有数据值的系列，它会从前后的值中线性插入一个值。它在查询时执行此操作 "upsampling" —— 原始数据始终按到达时的原样存储。您可以执行尾随窗口时间平均，但不能执行指数加权移动平均（我相信这就是您所说的时间加权的意思？）

http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html

（我应该补充一点，这不是对 OpenTSDB 作为您应该使用的数据库的全面建议，我只是在回答您的问题）

Answer 2

Axibase 时间序列数据库支持加权时间平均聚合器 (wtavg)：http://axibase.com/products/axibase-time-series-database/visualization/widgets/configuring-the-widgets/aggregators/

wtavg 以与当前时间相比线性下降的速率对较早的样本进行加权。

REST API、SQL 层和规则引擎支持此聚合器。

编辑 2016-06-15T12:52:00Z：支持 interpolation functions:

线性
上一个
下一个
VALUE(v)
NONE

披露：我为 Axibase 工作。

Answer 3

我最近不得不为我们自己的 SCADA/IoT 产品提供不规则样本的加权平均解决方案，数据存储在 PostgreSQL 中。如果您想自己推出，请按以下步骤操作。

让我们假设以下 table:

create table samples (
  stamp  timestamptz,
  series integer,
  value  float
);

insert into samples values
  ('2018-04-30 23:00:00+02', 1, 12.3),
  ('2018-05-01 01:45:00+02', 1, 22.2),
  ('2018-05-01 02:13:00+02', 1, 21.6),
  ('2018-05-01 02:26:00+02', 1, 14.9),
  ('2018-05-01 03:02:00+02', 1, 16.9);

要计算常规加权平均值，我们需要执行以下操作：

"Partition"将不规则样本转化为规则周期
确定每个样本保留多长时间（持续时间）
计算每个样本的权重（其持续时间除以周期）
每个时期的总价值乘以权重

在展示代码之前，我们将做出以下假设：

加权平均值是在给定时间范围内计算的。
我们不需要处理空值，这会使解决方案稍微复杂一些（即在计算权重时）。
代码是使用两种技术为 PostgreSQL 编写的：common table expressions and window functions。如果您使用另一个数据库，您可能需要以不同的方式编写它。

1。将不规则样本转换为规则周期

假设我们有兴趣计算 1 系列 2018-05-01 00:00:00+02 和 2018-05-01 04:00:00+02 之间时间段的 每小时 加权平均值。我们将从查询给定的时间范围开始，添加对齐的戳记：

select
  stamp,
  to_timestamp(extract (epoch from stamp)::integer / 3600 * 3600)
    as stamp_aligned,
  value
from samples
where
  series = 1 and
  stamp >= '2018-05-01 00:00:00+02' and
  stamp <= '2018-05-01 04:00:00+02';

这给了我们：

         stamp          |     stamp_aligned      | value 
------------------------+------------------------+-------
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9
(4 rows)

我们会注意到：

从结果中我们无法判断 00:00:00 的值，也无法判断 01:00:00 的值。
stamp_aligned 列告诉我们记录属于哪个时间段，但实际上 table 缺少每个时间段开始时的值。

为了解决这些问题，我们将查询给定时间范围之前的最后一个已知值，并添加圆形小时的记录，稍后我们将用正确的值填充这些记录：

with
t_values as (
  select * from (
    -- select last value prior to time range
    (select
      stamp,
      to_timestamp(extract(epoch from stamp)::integer / 3600 * 3600)
        as stamp_aligned,
      value,
      false as filled_in
    from samples
    where
      series = 1 and
      stamp <  '2018-05-01 00:00:00+02'
    order by
      stamp desc
    limit 1) union

    -- select records from given time range
    (select 
      stamp,
      to_timestamp(extract(epoch from stamp)::integer / 3600 * 3600)
        as stamp_aligned,
      value,
      false as filled_in
    from samples
    where
      series = 1 and
      stamp >= '2018-05-01 00:00:00+02' and
      stamp <= '2018-05-01 04:00:00+02'
    order by
      stamp) union

    -- select all regular periods for given time range
    (select
      stamp,
      stamp as stamp_aligned,
      null as value,
      true as filled_in
    from generate_series(
      '2018-05-01 00:00:00+02',
      '2018-05-01 04:00:00+02',
      interval '3600 seconds'
    ) stamp)
  ) states
  order by stamp
)
select * from t_values;

这给了我们

         stamp          |     stamp_aligned      | value | filled_in 
------------------------+------------------------+-------+-----------
 2018-04-30 23:00:00+02 | 2018-04-30 23:00:00+02 |  12.3 | f
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |     ¤ | t
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |     ¤ | t
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2 | f
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |     ¤ | t
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6 | f
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9 | f
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |     ¤ | t
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9 | f
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |     ¤ | t
(10 rows)

所以我们每个时间段至少有一条记录，但是对于已填写的记录，我们还需要填写数值：

with
t_values as (
  ...
),
-- since records generated using generate_series do not contain values,
-- we need to copy the value from the last non-generated record.
t_with_filled_in_values as (
  -- the outer query serves to remove any record prior to the given 
  -- time range
  select *
  from (
    select 
      stamp,
      stamp_aligned,
      -- fill in value from last non-filled record (the first record 
      -- having the same filled_in_partition value)
      (case when filled_in then
        first_value(value) over (partition by filled_in_partition
        order by stamp) else value end) as value
    from (
      select
        stamp, 
        stamp_aligned, 
        value,
        filled_in,
        -- this field is incremented on every non-filled record
        sum(case when filled_in then 0 else 1 end) 
          over (order by stamp) as filled_in_partition
      from 
        t_values
    ) t_filled_in_partition
  ) t_filled_in_values
  -- we wrap the filling-in query in order to remove any record before the
  -- beginning of the given time range
  where stamp >= '2018-05-01 00:00:00+02'
  order by stamp
)
select * from t_with_filled_in_values;

这给了我们以下内容：

         stamp          |     stamp_aligned      | value 
------------------------+------------------------+-------
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |  12.3
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |  12.3
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |  22.2
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |  14.9
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |  16.9
(9 rows)

所以我们都很好 - 我们已经为所有时间添加了具有正确值的记录，并且我们还删除了第一条记录，它为我们提供了时间范围开始的值，但位于它之外.不，我们已准备好进行下一步。

2。计算加权平均值

我们将继续计算每条记录的持续时间：

with
t_values as (
  ...
),
t_with_filled_in_values (
  ...
),
t_with_weight as (
  select
    stamp,
    stamp_aligned,
    value,
    -- use window to get stamp from next record in order to calculate 
    -- the duration of the record which, divided by the period, gives 
    -- us the weight.
    coalesce(extract(epoch from (lead(stamp)
      over (order by stamp) - stamp)), 3600)::float / 3600 as weight
  from t_with_filled_in_values
  order by stamp
)
select * from t_with_weight;

这给了我们：

         stamp          |     stamp_aligned      | value |       weight       
------------------------+------------------------+-------+--------------------
 2018-05-01 00:00:00+02 | 2018-05-01 00:00:00+02 |  12.3 |                  1
 2018-05-01 01:00:00+02 | 2018-05-01 01:00:00+02 |  12.3 |               0.75
 2018-05-01 01:45:00+02 | 2018-05-01 01:00:00+02 |  22.2 |               0.25
 2018-05-01 02:00:00+02 | 2018-05-01 02:00:00+02 |  22.2 |  0.216666666666667
 2018-05-01 02:13:00+02 | 2018-05-01 02:00:00+02 |  21.6 |  0.216666666666667
 2018-05-01 02:26:00+02 | 2018-05-01 02:00:00+02 |  14.9 |  0.566666666666667
 2018-05-01 03:00:00+02 | 2018-05-01 03:00:00+02 |  14.9 | 0.0333333333333333
 2018-05-01 03:02:00+02 | 2018-05-01 03:00:00+02 |  16.9 |  0.966666666666667
 2018-05-01 04:00:00+02 | 2018-05-01 04:00:00+02 |  16.9 |                  1
(9 rows)

剩下的就是总结一下：

with
t_values as (
  ...
),
t_with_filled_in_values (
  ...
),
t_with_weight as (
  ...
)
select
  stamp_aligned as stamp,
  sum(value * weight) as avg
from t_with_weight
group by stamp_aligned
order by stamp_aligned;

结果：

         stamp          |       avg        
------------------------+------------------
 2018-05-01 00:00:00+02 |             12.3
 2018-05-01 01:00:00+02 |           14.775
 2018-05-01 02:00:00+02 | 17.9333333333333
 2018-05-01 03:00:00+02 | 16.8333333333333
 2018-05-01 04:00:00+02 |             16.9
(5 rows)

您可以在 this gist 中找到完整的代码。

Answer 4

如果TSDB支持给定时间范围内的值积分功能，则可以计算时间加权平均值（TWA）。然后可以将 TWA 计算为给定持续时间除以持续时间的积分。例如，以下查询计算 VictoriaMetrics 过去一小时指标 power 的时间加权平均值：

integrate(power[1h])/1h

在 MetricsQL docs 查看有关 integrate() 函数的更多详细信息。

具有不规则时间序列的时间加权平均聚合功能的时间序列数据库？

Time series database with time-weighted-average aggregation function for irregular time series?

database

time-series

influxdb

1。将不规则样本转换为规则周期

2。计算加权平均值