如何在 SQL 中的 window 中获取组中的第一个元素?
How to get the first element in groups over a window in SQL?
我想获取组的第一个元素,但是必须为每个window计算组。我想做这样的事情:
架构
TABLE
id
target
groups
capture_date
event_date
SELECT
AVG(
FIRST(target) GROUP BY id ORDER BY capture_date DESC WHERE capture_date <= MAX(event_date)
) OVER (
PARTITION BY groups
ORDER BY event_date
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
)
FROM table
我想在 sql 或 pyspark 中执行此操作,无论更简单。有任何想法吗?谢谢!
这是一个完整的 SQL 版本,但如果需要,可以在 spark/pyspark 中 re-written。我使用了 groupby,但你也可以 运行 秒 window with row_number & where
with raw_averages as ( -- short cut for subquery
SELECT
AVG(
target
) OVER (
PARTITION BY groups
ORDER BY event_date
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as average,
ID,
capture_date,
event_date
FROM table ),
grouped_result as -- more shortcut for subquery
(SELECT
id,
avg(average) as average, -- the average of the entire group is the same -->math trick
reverse( -- sort descending
array_sort( --sort ascending by first item ( event_date )
arrays_zip( -- create one array of below arrays
collect_list( event_date ), -- collect the grouped items *has to be first to get the ordering you want*
collect_list( capture_date ) -- collect the grouped items
)
)
)[0] as values --getting first will return max (reference first item in array)
from raw_averages
GROUP BY groups, id
-- HAVING values.`0` = values.`1` -- having might work here but I didn't explore it
)
select
groups,
id,
average,
values.`0` as event_date -- awkward syntax because of arrays_zip
values.`1` as capture_date
from
grouped_result
where values.`0` = values.`1`
我想获取组的第一个元素,但是必须为每个window计算组。我想做这样的事情:
架构
TABLE |
---|
id |
target |
groups |
capture_date |
event_date |
SELECT
AVG(
FIRST(target) GROUP BY id ORDER BY capture_date DESC WHERE capture_date <= MAX(event_date)
) OVER (
PARTITION BY groups
ORDER BY event_date
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
)
FROM table
我想在 sql 或 pyspark 中执行此操作,无论更简单。有任何想法吗?谢谢!
这是一个完整的 SQL 版本,但如果需要,可以在 spark/pyspark 中 re-written。我使用了 groupby,但你也可以 运行 秒 window with row_number & where
with raw_averages as ( -- short cut for subquery
SELECT
AVG(
target
) OVER (
PARTITION BY groups
ORDER BY event_date
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) as average,
ID,
capture_date,
event_date
FROM table ),
grouped_result as -- more shortcut for subquery
(SELECT
id,
avg(average) as average, -- the average of the entire group is the same -->math trick
reverse( -- sort descending
array_sort( --sort ascending by first item ( event_date )
arrays_zip( -- create one array of below arrays
collect_list( event_date ), -- collect the grouped items *has to be first to get the ordering you want*
collect_list( capture_date ) -- collect the grouped items
)
)
)[0] as values --getting first will return max (reference first item in array)
from raw_averages
GROUP BY groups, id
-- HAVING values.`0` = values.`1` -- having might work here but I didn't explore it
)
select
groups,
id,
average,
values.`0` as event_date -- awkward syntax because of arrays_zip
values.`1` as capture_date
from
grouped_result
where values.`0` = values.`1`