当发生变化时,将数据从单个时间戳建模为具有 valid_from/valid_to 个时间戳的记录

Modelling data from a single timestamp to a record with valid_from/valid_to timestamps when there is a change

这是 table1 和一些示例数据:

id date_column col1 col2
1 06/03/2021 NULL 1
1 07/03/2021 NULL 1
1 08/03/2021 1 1
1 09/03/2021 1 2
2 05/03/2021 1 1
2 09/03/2021 1 1

我想把它转换成下面的格式:

id valid_from valid_to col1 col2
1 06/03/2021 08/03/2021 NULL 1
1 08/03/2021 09/03/2021 1 1
1 09/03/2021 01/01/2100 1 2
2 05/03/2021 01/01/2100 1 1

因此,每当 col1col2.[=15 中有新值时,都会创建所需格式的新行=]

valid_fromdate_column 中 col1 和 col2 中此唯一值的最早值,而 valid_todate_column 中的最早值,当这些值中的任何一个发生变化时。

我能够通过以下 SQL(Presto 特定)实现此转换:

WITH base AS (
SELECT
*
FROM (
  VALUES
    (1, date('2021-03-06'), NULL, 1),
    (1, date('2021-03-07'), NULL, 1),
    (1, date('2021-03-08'), 1, 1),
    (1, date('2021-03-09'), 1, 2),
    (2, date('2021-03-05'), 1, 1),
    (2, date('2021-03-09'), 1, 1)
) AS t (id, date_column, col1, col2)
)

, base2 AS (
SELECT
  id
, date_column
, col1
, col2
, array_join(array[cast(col1 AS VARCHAR),
                   cast(col2 AS VARCHAR)], '','null') AS col_dedup
FROM
  base
)

, base3 AS (
SELECT
  id
, date_column
, col1
, col2

, coalesce(
    lag(col_dedup) OVER (PARTITION BY id  ORDER BY date_column) = col_dedup, 
    false
) AS same_as_previous

from base2
)

SELECT
  id
, date_column                                                                          AS valid_from
, lead(date_column, 1, date('2100-01-01')) OVER (PARTITION BY id ORDER BY date_column) AS valid_to
, col1
, col2
FROM
  base3
WHERE
  same_as_previous = false
ORDER BY
  id
, date_column ASC

困难在于当你有 100 列时,所有这 100 列都必须出现在 array_join

现在真正的问题 - 是否有更好的方法来进行上述转换?

这是一种间隙和孤岛问题。 . .但实际上是一个简单的版本。您需要每个分组的第一行。然后lead()得到结束日期:

select id, col1, col2, datecol as valid_from,
       lead(datecol, 1, '2100-01-01') over (partition by id order by datecol) as valid_to
from (select t1.*,
             lag(datecol) over (partition by id order by datecol) as prev_datecol,
             lag(datecol) over (partition by id, col1, col2 order by datecol1) as prev_datecol_12
      from table1 t1
     ) t1
where prev_datecol_12 is null or 
      (prev_datecol <> prev_datecol_12);

请注意,此方法不需要聚合,聚合通常更快。

更重要的是,这会处理将值 return 设为前一组值的组。你的方法不会那样做。我猜这就是您真正想要解决此类问题的方法。