基于周范围 (impala) 对 window 的列值求和
Sum column values over a window based on a week range (impala)
给定一个 table 如下:
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
我正在寻找一种有效的方法来计算当前行之后 6 天内发生的连接数(当前行包含在总和中),按 client_id 分区,这将导致:
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
问题:
我不想添加所有缺失的日期,然后执行滑动 window 计算后面的 7 行,因为我的 table 已经非常大了。
我正在使用 Impala,不支持 range between interval '7' days following and current row
。
Edit :我正在寻找一个通用的答案,考虑到我需要将 window 大小更改为更大的数字(30 天以上示例)
这回答了问题的原始版本。
Impala 不完全支持 range between
。不幸的是,这并没有留下很多选择。一种是使用具有大量显式逻辑的 lag()
:
select t.*,
( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 6) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 5) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 4) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 3) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 2) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 1) over (partition by client_id order by date)
else 0
end) +
connections
) as connections_within_6_days
from t;
不幸的是,这不能很好地概括。如果你想要大范围的天数,你可能想问另一个问题。
给定一个 table 如下:
client_id date connections
---------------------------------------
121438297 2018-01-03 0
121438297 2018-01-08 1
121438297 2018-01-10 3
121438297 2018-01-12 1
121438297 2018-01-19 7
363863811 2018-01-18 0
363863811 2018-01-30 5
363863811 2018-02-01 4
363863811 2018-02-10 0
我正在寻找一种有效的方法来计算当前行之后 6 天内发生的连接数(当前行包含在总和中),按 client_id 分区,这将导致:
client_id date connections connections_within_6_days
---------------------------------------------------------------------
121438297 2018-01-03 0 1
121438297 2018-01-08 1 5
121438297 2018-01-10 3 4
121438297 2018-01-12 1 1
121438297 2018-01-19 7 7
363863811 2018-01-18 0 0
363863811 2018-01-30 5 9
363863811 2018-02-01 4 4
363863811 2018-02-10 0 0
问题:
我不想添加所有缺失的日期,然后执行滑动 window 计算后面的 7 行,因为我的 table 已经非常大了。
我正在使用 Impala,不支持
range between interval '7' days following and current row
。
Edit :我正在寻找一个通用的答案,考虑到我需要将 window 大小更改为更大的数字(30 天以上示例)
这回答了问题的原始版本。
Impala 不完全支持 range between
。不幸的是,这并没有留下很多选择。一种是使用具有大量显式逻辑的 lag()
:
select t.*,
( (case when lag(date, 6) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 6) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 5) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 5) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 4) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 4) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 3) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 3) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 2) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 2) over (partition by client_id order by date)
else 0
end) +
(case when lag(date, 1) over (partition by client_id order by date) = date - interval 6 day
then lag(connections, 1) over (partition by client_id order by date)
else 0
end) +
connections
) as connections_within_6_days
from t;
不幸的是,这不能很好地概括。如果你想要大范围的天数,你可能想问另一个问题。