Pivot/Denormalize 超出有效范围
Pivot/Denormalize Over Effective Ranges
我正在寻找将交易数据集转换为 SCD2 的方式,以捕获每个组合在枢轴粒度上有效的时间间隔。
Snowflake 是我实际使用的 DBMS,但也标记了 Oracle,因为它们的方言几乎相同。不过,我可能会骗取为任何 DBMS 提供的解决方案。
我在工作 sql,但它是在反复试验中诞生的,我觉得必须有一种更优雅的方式,我想念它,因为它非常丑陋且计算量大。
(注意:输入数据中的第二条记录“过期”第一条记录。可以假设感兴趣的每一天都将作为 add_dts 至少出现一次。)
(在最后添加为图像,直到我弄清楚为什么标记不起作用)
输入:
Original_Grain
Pivot_Grain
Pivot_Column
Pivot_Attribute
ADD_TS
OG-1
PG-1
First_Col
A
2020-01-01
OG-1
PG-1
First_Col
B
2020-01-02
OG-2
PG-1
Second_Col
A
2020-01-01
OG-3
PG-1
Third_Col
C
2020-01-02
OG-3
PG-1
Third_Col
B
2020-01-03
输出:
Pivot_Grain
First_Col
Second_Col
Third_Col
From_Dt
To_Dt
PG-1
A
A
NULL
2020-01-01
2020-01-02
PG-1
B
A
C
2020-01-02
2020-01-03
PG-1
B
A
B
2020-01-03
9999-01-01
WITH INPUT AS
( SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD') AS Add_Dts
FROM dual
UNION
SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-2' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Second_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'C' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-03','YYYY-MM-DD')
FROM dual
),
GET_NORMALIZED_RANGES AS
( SELECT I.*,
COALESCE(
LEAD(Add_Dts) OVER (
PARTITION BY I.Original_Grain
ORDER BY I.Add_Dts), TO_DATE('9000-01-01')
) AS Next_Add_Dts
FROM INPUT I
),
GET_DISTINCT_ADD_DATES AS
( SELECT DISTINCT Add_Dts AS Driving_Date
FROM Input
),
NORMALIZED_EFFECTIVE_AT_EACH_POINT AS
( SELECT GNR.*,
GDAD.Driving_Date
FROM GET_NORMALIZED_RANGES GNR
INNER
JOIN GET_DISTINCT_ADD_DATES GDAD
ON GDAD.driving_date >= GNR.add_dts
AND GDAD.driving_Date < GNR.next_add_dts
),
PIVOT_EACH_POINT AS
( SELECT DISTINCT
Pivot_Grain,
Driving_Date,
MAX("'First_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS First_Col,
MAX("'Second_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Second_Col,
MAX("'Third_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Third_Col
FROM NORMALIZED_EFFECTIVE_AT_EACH_POINT NEP
PIVOT (MAX(Pivot_Attribute) FOR PIVOT_COLUMN IN ('First_Col','Second_Col','Third_Col'))
)
SELECT Pivot_Grain,
Driving_Date AS From_Dt,
COALESCE(LEAD(Driving_Date) OVER ( PARTITION BY pivot_grain ORDER BY Driving_Date),TO_DATE('9999-01-01')) AS To_Dt,
First_Col,
Second_Col,
Third_Col
FROM PIVOT_EACH_POINT
不确定这是否回答了您的问题,但请参阅 https://jeffreyjacobs.wordpress.com/2021/03/03/pivoting-iiot-data-in-snowflake/
因此可以使用 VALUES 运算符编写输入,并将列名放入 CTE 定义中,使其占用更少 space。
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-02'::date),
('OG-2', 'PG-1', 'Second_Col', 'A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-02'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-03'::date)
)
可以通过使用默认值来简化 LEAD,这是一个隐式的 COALESCE,但有时如果您在这种类型的数据中有间隙,IGNORE NULLS OVER 也是一个很棒的工具。
, get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
)
get_distinct_add_dates 看起来不错。
, get_distinct_add_dates AS (
SELECT DISTINCT add_dts AS driving_date
FROM input
)
取决于你的数据 normalized_effective_at_each_point
会以它的名字为名,并在每个 time/date 点给你一个值,这将切分不相关的值(我假设 pivot_grain 是一些全局事物 ID,它是不同的数据,因此此输入支持它)
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
此时 get_distinct_add_dates
应变为:
, get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
)
INNER JOIN 是一个 JOIN,所以我们可以跳过不需要的 INNER
, normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
),
真的 pivot_each_point
是三向 JOIN,或者可以编写 GROUP BY,DISTINCT 确实为我们做了这些,因此 PIVOT 消失了。
, pivot_each_point AS (
SELECT Pivot_Grain
,Driving_Date
,MAX(IFF(pivot_column='First_Col', Pivot_Attribute, NULL)) as first_col
,MAX(IFF(pivot_column='Second_Col', Pivot_Attribute, NULL)) as second_col
,MAX(IFF(pivot_column='Third_Col', Pivot_Attribute, NULL)) as third_col
FROM normalized_effective_at_each_point
GROUP BY 1,2
)
最后,最后的线索可以放下 COALESCE 并移动到 pivot_each_point
。
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
), get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
), get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
), normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
)
SELECT pivot_grain
,driving_date
,LEAD(driving_date, 1, '9999-01-01'::date) OVER (PARTITION BY pivot_grain ORDER BY driving_date) AS to_dt
,MAX(IFF(pivot_column = 'First_Col', pivot_attribute, NULL)) AS first_col
,MAX(IFF(pivot_column = 'Second_Col', pivot_attribute, NULL)) AS second_col
,MAX(IFF(pivot_column = 'Third_Col', pivot_attribute, NULL)) AS third_col
FROM normalized_effective_at_each_point
GROUP BY pivot_grain, driving_date
ORDER BY pivot_grain, driving_date;
给出结果:
PIVOT_GRAIN DRIVING_DATE TO_DT FIRST_COL SECOND_COL THIRD_COL
PG-1 2020-01-01 2020-01-03 A A null
PG-1 2020-01-03 2020-01-05 B A C
PG-1 2020-01-05 9999-01-01 B A B
PG-2 2020-02-02 2020-02-04 D D null
PG-2 2020-02-04 2020-02-06 E D F
PG-2 2020-02-06 9999-01-01 E D D
我忍不住想我已经把我处理数据的方式过度映射到你的 PIVOT_GRAIN 上了。现在我理解了代码,我开始尝试从第一原则再次解决这个问题,我认为前三个处理 CTE 是我会怎么做,因此 GROUP BY 也是我会做的其余部分,许多 JOIN 似乎真的总的来说,在 Snowflake 中,我更喜欢这种爆炸数据,然后合并(或 GROUP BY)数据,因为这一切都很好而且可以并行化。
我正在寻找将交易数据集转换为 SCD2 的方式,以捕获每个组合在枢轴粒度上有效的时间间隔。
Snowflake 是我实际使用的 DBMS,但也标记了 Oracle,因为它们的方言几乎相同。不过,我可能会骗取为任何 DBMS 提供的解决方案。
我在工作 sql,但它是在反复试验中诞生的,我觉得必须有一种更优雅的方式,我想念它,因为它非常丑陋且计算量大。
(注意:输入数据中的第二条记录“过期”第一条记录。可以假设感兴趣的每一天都将作为 add_dts 至少出现一次。) (在最后添加为图像,直到我弄清楚为什么标记不起作用)
输入:
Original_Grain | Pivot_Grain | Pivot_Column | Pivot_Attribute | ADD_TS |
---|---|---|---|---|
OG-1 | PG-1 | First_Col | A | 2020-01-01 |
OG-1 | PG-1 | First_Col | B | 2020-01-02 |
OG-2 | PG-1 | Second_Col | A | 2020-01-01 |
OG-3 | PG-1 | Third_Col | C | 2020-01-02 |
OG-3 | PG-1 | Third_Col | B | 2020-01-03 |
输出:
Pivot_Grain | First_Col | Second_Col | Third_Col | From_Dt | To_Dt |
---|---|---|---|---|---|
PG-1 | A | A | NULL | 2020-01-01 | 2020-01-02 |
PG-1 | B | A | C | 2020-01-02 | 2020-01-03 |
PG-1 | B | A | B | 2020-01-03 | 9999-01-01 |
WITH INPUT AS
( SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD') AS Add_Dts
FROM dual
UNION
SELECT 'OG-1' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'First_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-2' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Second_Col' AS Pivot_Column,
'A' AS Pivot_Attribute,
TO_DATE('2020-01-01','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'C' AS Pivot_Attribute,
TO_DATE('2020-01-02','YYYY-MM-DD')
FROM dual
UNION
SELECT 'OG-3' AS Original_Grain,
'PG-1' AS Pivot_Grain,
'Third_Col' AS Pivot_Column,
'B' AS Pivot_Attribute,
TO_DATE('2020-01-03','YYYY-MM-DD')
FROM dual
),
GET_NORMALIZED_RANGES AS
( SELECT I.*,
COALESCE(
LEAD(Add_Dts) OVER (
PARTITION BY I.Original_Grain
ORDER BY I.Add_Dts), TO_DATE('9000-01-01')
) AS Next_Add_Dts
FROM INPUT I
),
GET_DISTINCT_ADD_DATES AS
( SELECT DISTINCT Add_Dts AS Driving_Date
FROM Input
),
NORMALIZED_EFFECTIVE_AT_EACH_POINT AS
( SELECT GNR.*,
GDAD.Driving_Date
FROM GET_NORMALIZED_RANGES GNR
INNER
JOIN GET_DISTINCT_ADD_DATES GDAD
ON GDAD.driving_date >= GNR.add_dts
AND GDAD.driving_Date < GNR.next_add_dts
),
PIVOT_EACH_POINT AS
( SELECT DISTINCT
Pivot_Grain,
Driving_Date,
MAX("'First_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS First_Col,
MAX("'Second_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Second_Col,
MAX("'Third_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Third_Col
FROM NORMALIZED_EFFECTIVE_AT_EACH_POINT NEP
PIVOT (MAX(Pivot_Attribute) FOR PIVOT_COLUMN IN ('First_Col','Second_Col','Third_Col'))
)
SELECT Pivot_Grain,
Driving_Date AS From_Dt,
COALESCE(LEAD(Driving_Date) OVER ( PARTITION BY pivot_grain ORDER BY Driving_Date),TO_DATE('9999-01-01')) AS To_Dt,
First_Col,
Second_Col,
Third_Col
FROM PIVOT_EACH_POINT
不确定这是否回答了您的问题,但请参阅 https://jeffreyjacobs.wordpress.com/2021/03/03/pivoting-iiot-data-in-snowflake/
因此可以使用 VALUES 运算符编写输入,并将列名放入 CTE 定义中,使其占用更少 space。
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-02'::date),
('OG-2', 'PG-1', 'Second_Col', 'A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-02'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-03'::date)
)
可以通过使用默认值来简化 LEAD,这是一个隐式的 COALESCE,但有时如果您在这种类型的数据中有间隙,IGNORE NULLS OVER 也是一个很棒的工具。
, get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
)
get_distinct_add_dates 看起来不错。
, get_distinct_add_dates AS (
SELECT DISTINCT add_dts AS driving_date
FROM input
)
取决于你的数据 normalized_effective_at_each_point
会以它的名字为名,并在每个 time/date 点给你一个值,这将切分不相关的值(我假设 pivot_grain 是一些全局事物 ID,它是不同的数据,因此此输入支持它)
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
此时 get_distinct_add_dates
应变为:
, get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
)
INNER JOIN 是一个 JOIN,所以我们可以跳过不需要的 INNER
, normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
),
真的 pivot_each_point
是三向 JOIN,或者可以编写 GROUP BY,DISTINCT 确实为我们做了这些,因此 PIVOT 消失了。
, pivot_each_point AS (
SELECT Pivot_Grain
,Driving_Date
,MAX(IFF(pivot_column='First_Col', Pivot_Attribute, NULL)) as first_col
,MAX(IFF(pivot_column='Second_Col', Pivot_Attribute, NULL)) as second_col
,MAX(IFF(pivot_column='Third_Col', Pivot_Attribute, NULL)) as third_col
FROM normalized_effective_at_each_point
GROUP BY 1,2
)
最后,最后的线索可以放下 COALESCE 并移动到 pivot_each_point
。
WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS (
SELECT * FROM VALUES
('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date)
), get_normalized_ranges AS (
SELECT
*
,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
FROM input
), get_distinct_add_dates AS (
SELECT DISTINCT pivot_grain, add_dts AS driving_date
FROM input
), normalized_effective_at_each_point AS (
SELECT gnr.*,
gdad.driving_date
FROM get_normalized_ranges AS gnr
JOIN get_distinct_add_dates AS gdad
ON gnr.pivot_grain = gdad.pivot_grain
AND gdad.driving_date >= gnr.add_dts
AND gdad.driving_date < gnr.next_add_dts
)
SELECT pivot_grain
,driving_date
,LEAD(driving_date, 1, '9999-01-01'::date) OVER (PARTITION BY pivot_grain ORDER BY driving_date) AS to_dt
,MAX(IFF(pivot_column = 'First_Col', pivot_attribute, NULL)) AS first_col
,MAX(IFF(pivot_column = 'Second_Col', pivot_attribute, NULL)) AS second_col
,MAX(IFF(pivot_column = 'Third_Col', pivot_attribute, NULL)) AS third_col
FROM normalized_effective_at_each_point
GROUP BY pivot_grain, driving_date
ORDER BY pivot_grain, driving_date;
给出结果:
PIVOT_GRAIN DRIVING_DATE TO_DT FIRST_COL SECOND_COL THIRD_COL
PG-1 2020-01-01 2020-01-03 A A null
PG-1 2020-01-03 2020-01-05 B A C
PG-1 2020-01-05 9999-01-01 B A B
PG-2 2020-02-02 2020-02-04 D D null
PG-2 2020-02-04 2020-02-06 E D F
PG-2 2020-02-06 9999-01-01 E D D
我忍不住想我已经把我处理数据的方式过度映射到你的 PIVOT_GRAIN 上了。现在我理解了代码,我开始尝试从第一原则再次解决这个问题,我认为前三个处理 CTE 是我会怎么做,因此 GROUP BY 也是我会做的其余部分,许多 JOIN 似乎真的总的来说,在 Snowflake 中,我更喜欢这种爆炸数据,然后合并(或 GROUP BY)数据,因为这一切都很好而且可以并行化。