Build/Rebuild 来自源数据的完整历史类型 2 Table
Build/Rebuild a full history Type-2 Table from source data
我在 PSA 中有一个 table,我在其中捕获对源 table 中记录的更改。假设它看起来像这样:
+-----+------------+----------------+------------------+------------------+
| PK | Check_cols | Dont_care_cols | start_ts | end_ts |
+-----+------------+----------------+------------------+------------------+
| 123 | abc | def | 1/1/20 00:10:00 | 1/2/20 13:13:23 |
| 123 | abc | dhf | 1/2/20 13:13:23 | 1/3/20 04:21:00 |
| 123 | abc | dhz | 1/3/20 04:21:00 | 1/5/20 12:15:00 |
| 123 | abd | dyz | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | dyz | 1/9/20 15:16:00 | null |
| 456 | ghi | jkl | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | opq | 1/10/20 00:00:00 | null |
+-----+------------+----------------+------------------+------------------+
我想从 table 仅使用 check_cols 的值构建一个类型 2 维度(跟踪记录开始和停止时间的变化),如下所示。我正在寻找一个没有循环的纯 SQL 解决方案。
check_cols 由多个列组成,但我将使用 md5 散列来查找更改。由于我的维度只关心 check_cols,因此在某些情况下时间戳记录不是我所需要的。例如,如果 dont_care_cols 中的值发生变化,但 check_cols 值中的 none 发生变化。
根据上面的数据,我想要下面的结果集:
+-----+------------+------------------+------------------+
| PK | Check_cols | start_ts | end_ts |
+-----+------------+------------------+------------------+
| 123 | abc | 1/1/20 00:10:00 | 1/5/20 12:15:00 |
| 123 | abd | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | 1/9/20 15:16:00 | null |
| 456 | ghi | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | 1/10/20 00:00:00 | null |
+-----+------------+------------------+------------------+
我已经尝试使用 window 函数来比较超前值和滞后值,获取最小值和最大值等,但我无法弄清楚第一个 [=32] 中 PK 123 显示的这种边缘情况=].我也没有通过 google/Whosebug/etc 找到解决方案。大多数方法依赖于每日快照 运行。如果我有逻辑更改,我希望能够重建目标 table。有人有想法吗?
我不知道这是否是最佳答案,或者它是否解决了您的所有用例,但请尝试一下,让我知道是否存在偶然发现的边缘情况。这有点骇人听闻。另外,我确实在用例中添加了一些记录:
CREATE OR REPLACE TEMP TABLE tran_data (pk int, check_cols varchar, dont_care_cols varchar, start_ts timestamp, end_ts timestamp);
INSERT INTO tran_data
SELECT *
FROM (VALUES(123,'abc','def',TO_TIMESTAMP('1/1/20 00:10:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss')),
(123,'abc','dhf',TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dhz',TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss'),null),
(456,'ghi','jkl',TO_TIMESTAMP('1/2/20 03:45:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss')),
(456,'lmn','opq',TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss'),null)
);
从那里,我试图找到一种创建“组”的方法,我希望它能经得起你们所有的用例:
SELECT DISTINCT
PK
, check_cols
, FIRST_VALUE(start_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_start_ts
, LAST_VALUE(end_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_end_ts
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK, check_cols ORDER BY start_ts) as group_cnt
, group_cnt - pk_row as group_num
, *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK ORDER BY start_ts) as pk_row
, IFNULL(LAG(check_cols) OVER (PARTITION BY PK ORDER BY start_ts),check_cols) as prev_check_cols
, *
FROM tran_data
)
)
ORDER BY pk, new_start_ts;
我在 PSA 中有一个 table,我在其中捕获对源 table 中记录的更改。假设它看起来像这样:
+-----+------------+----------------+------------------+------------------+
| PK | Check_cols | Dont_care_cols | start_ts | end_ts |
+-----+------------+----------------+------------------+------------------+
| 123 | abc | def | 1/1/20 00:10:00 | 1/2/20 13:13:23 |
| 123 | abc | dhf | 1/2/20 13:13:23 | 1/3/20 04:21:00 |
| 123 | abc | dhz | 1/3/20 04:21:00 | 1/5/20 12:15:00 |
| 123 | abd | dyz | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | dyz | 1/9/20 15:16:00 | null |
| 456 | ghi | jkl | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | opq | 1/10/20 00:00:00 | null |
+-----+------------+----------------+------------------+------------------+
我想从 table 仅使用 check_cols 的值构建一个类型 2 维度(跟踪记录开始和停止时间的变化),如下所示。我正在寻找一个没有循环的纯 SQL 解决方案。
check_cols 由多个列组成,但我将使用 md5 散列来查找更改。由于我的维度只关心 check_cols,因此在某些情况下时间戳记录不是我所需要的。例如,如果 dont_care_cols 中的值发生变化,但 check_cols 值中的 none 发生变化。
根据上面的数据,我想要下面的结果集:
+-----+------------+------------------+------------------+
| PK | Check_cols | start_ts | end_ts |
+-----+------------+------------------+------------------+
| 123 | abc | 1/1/20 00:10:00 | 1/5/20 12:15:00 |
| 123 | abd | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | 1/9/20 15:16:00 | null |
| 456 | ghi | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | 1/10/20 00:00:00 | null |
+-----+------------+------------------+------------------+
我已经尝试使用 window 函数来比较超前值和滞后值,获取最小值和最大值等,但我无法弄清楚第一个 [=32] 中 PK 123 显示的这种边缘情况=].我也没有通过 google/Whosebug/etc 找到解决方案。大多数方法依赖于每日快照 运行。如果我有逻辑更改,我希望能够重建目标 table。有人有想法吗?
我不知道这是否是最佳答案,或者它是否解决了您的所有用例,但请尝试一下,让我知道是否存在偶然发现的边缘情况。这有点骇人听闻。另外,我确实在用例中添加了一些记录:
CREATE OR REPLACE TEMP TABLE tran_data (pk int, check_cols varchar, dont_care_cols varchar, start_ts timestamp, end_ts timestamp);
INSERT INTO tran_data
SELECT *
FROM (VALUES(123,'abc','def',TO_TIMESTAMP('1/1/20 00:10:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss')),
(123,'abc','dhf',TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dhz',TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss'),null),
(456,'ghi','jkl',TO_TIMESTAMP('1/2/20 03:45:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss')),
(456,'lmn','opq',TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss'),null)
);
从那里,我试图找到一种创建“组”的方法,我希望它能经得起你们所有的用例:
SELECT DISTINCT
PK
, check_cols
, FIRST_VALUE(start_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_start_ts
, LAST_VALUE(end_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_end_ts
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK, check_cols ORDER BY start_ts) as group_cnt
, group_cnt - pk_row as group_num
, *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK ORDER BY start_ts) as pk_row
, IFNULL(LAG(check_cols) OVER (PARTITION BY PK ORDER BY start_ts),check_cols) as prev_check_cols
, *
FROM tran_data
)
)
ORDER BY pk, new_start_ts;