(REDSHIFT) 垂直合并 / FIRST_VALUE() 作为聚合
(REDSHIFT) Vertical Coalesce / FIRST_VALUE() as an Aggregate
(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)
当按时间戳排序时,我需要按类别从每一列中获取第一个非 NULL 值。
本质上,与 FIRST_VALUE() 相同,但作为一个聚合。
或者,作为聚合的 COALESCE()。
但是,Redshift 没有较新版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入:)
(我不喜欢我的任何一个选择,但我很难找到更好的选择。)
示例输入
category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------
01 | 001 | NULL | NULL | 4
01 | 010 | 7 | NULL | NULL
01 | 100 | NULL | 1 | 2
01 | 999 | 6 | 3 | 6
02 | 001 | 1 | NULL | NULL
02 | 010 | NULL | 2 | NULL
02 | 100 | NULL | 1 | 9
02 | 999 | 6 | 3 | 2
预期结果
category | value_a | value_b | value_c
----------+-------------------------+---------+---------
01 | 7 | 1 | 4
02 | 1 | 2 | 9
当前解决方案
SELECT DISTINCT
category,
FIRST_VALUE(value_a IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_a,
FIRST_VALUE(value_b IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_b,
FIRST_VALUE(value_c IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_c
FROM
mytable
它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。
如果只是一两列,这可能有效(但它是一打列,太可怕了)...
WITH
sorted_value_a AS
(
SELECT
category,
value_a,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_a IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_b AS
(
SELECT
category,
value_b,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_b IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_c AS
(
SELECT
category,
value_c,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_c IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
)
SELECT
*
FROM
sorted_value_a AS a
INNER JOIN
sorted_value_b AS b
ON b.category = a.category
INNER JOIN
sorted_value_c AS c
ON c.category = a.category
嗯,我不知道这是否美观,但你可以这样做:
select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
row_number() over (partition by category order by row_timestamp desc) as seqnum
from mytable t
) t
where seqnum = 1;
(这是 Redshift 特有的,应该考虑到它的柱状性质、排序顺序等)
当按时间戳排序时,我需要按类别从每一列中获取第一个非 NULL 值。
本质上,与 FIRST_VALUE() 相同,但作为一个聚合。
或者,作为聚合的 COALESCE()。
但是,Redshift 没有较新版本的 PostgreSQL 或 Oracle 的优点。所以,我正在寻找选项来测试我的 1 亿行导入:)
(我不喜欢我的任何一个选择,但我很难找到更好的选择。)
示例输入
category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------
01 | 001 | NULL | NULL | 4
01 | 010 | 7 | NULL | NULL
01 | 100 | NULL | 1 | 2
01 | 999 | 6 | 3 | 6
02 | 001 | 1 | NULL | NULL
02 | 010 | NULL | 2 | NULL
02 | 100 | NULL | 1 | 9
02 | 999 | 6 | 3 | 2
预期结果
category | value_a | value_b | value_c
----------+-------------------------+---------+---------
01 | 7 | 1 | 4
02 | 1 | 2 | 9
当前解决方案
SELECT DISTINCT
category,
FIRST_VALUE(value_a IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_a,
FIRST_VALUE(value_b IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_b,
FIRST_VALUE(value_c IGNORE NULLS)
OVER (PARTITION BY category
ORDER BY row_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
AS value_c
FROM
mytable
它有效,但 DISTINCT 可能适用于数百或数千行。不太理想。
如果只是一两列,这可能有效(但它是一打列,太可怕了)...
WITH
sorted_value_a AS
(
SELECT
category,
value_a,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_a IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_b AS
(
SELECT
category,
value_b,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_b IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
),
sorted_value_c AS
(
SELECT
category,
value_c,
ROW_NUMBER() OVER (PARTITION BY category
ORDER BY value_c IS NOT NULL, row_timestamp
)
AS row_ordinal
FROM
myTable
)
SELECT
*
FROM
sorted_value_a AS a
INNER JOIN
sorted_value_b AS b
ON b.category = a.category
INNER JOIN
sorted_value_c AS c
ON c.category = a.category
嗯,我不知道这是否美观,但你可以这样做:
select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
row_number() over (partition by category order by row_timestamp desc) as seqnum
from mytable t
) t
where seqnum = 1;