如何在 BigQuery 中使用线性插值法填充不规则缺失值?
How to fill irregularly missing values with linear interepolation in BigQuery?
我有不规则缺失值的数据,我想使用 BigQuery Standard SQL.
具体来说,我有这样的数据:
# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+-------+
我想按如下方式转换此 table:
# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+--------------------+
有什么聪明的解决方案吗?
补充:本题与类似,不同之处在于数据不规则缺失
谢谢。
这是一个如何在 Postgresql 中解决这个问题的例子。
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
+------+-----------------+-----+-------------------------+
| time | generate_series | grp | val_grp |
+------+-----------------+-----+-------------------------+
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
+------+-----------------+-----+-------------------------+
我相信在使用 UNNEST 和 GENERATE_ARRAY 的 BigQuery 中语法会有所不同,如下所示。你可以试试看。
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series
在 BigQuery 中,您可以使用 generate_array()
为每一行生成额外的行。然后你可以使用 lead()
从下一行获取信息和一些插值算法:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
),
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
)
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
注意:您有一个名为 time
的列,其中包含一个整数。如果这确实是某种类型的 date/time 数据类型,我建议您提出一个 new 问题,其中包含更合适的示例数据和所需的结果——如果您不这样做的话查看如何调整此答案。
以下适用于 BigQuery 标准 SQL
#standardSQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
using(time)
)
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
)
如果应用于您问题中的示例数据 - 输出为
我有不规则缺失值的数据,我想使用 BigQuery Standard SQL.
具体来说,我有这样的数据:
# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+-------+
我想按如下方式转换此 table:
# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+--------------------+
有什么聪明的解决方案吗?
补充:本题与
谢谢。
这是一个如何在 Postgresql 中解决这个问题的例子。
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
+------+-----------------+-----+-------------------------+
| time | generate_series | grp | val_grp |
+------+-----------------+-----+-------------------------+
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
+------+-----------------+-----+-------------------------+
我相信在使用 UNNEST 和 GENERATE_ARRAY 的 BigQuery 中语法会有所不同,如下所示。你可以试试看。
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series
在 BigQuery 中,您可以使用 generate_array()
为每一行生成额外的行。然后你可以使用 lead()
从下一行获取信息和一些插值算法:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
),
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
)
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
注意:您有一个名为 time
的列,其中包含一个整数。如果这确实是某种类型的 date/time 数据类型,我建议您提出一个 new 问题,其中包含更合适的示例数据和所需的结果——如果您不这样做的话查看如何调整此答案。
以下适用于 BigQuery 标准 SQL
#standardSQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
using(time)
)
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
)
如果应用于您问题中的示例数据 - 输出为