将一个维度实体的历史时期合并为一个
Merge historical periods of an dimension entity into one
我有一个缓慢变化的维度类型 2,其中的行是相同的(除了开始日期和结束日期之外)。如何编写漂亮的 SQL 查询来合并相同且具有连接时间段的行?
当前数据
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-01-31 |
| 1 | SomeValue | 2019-02-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-04-30 |
| 1 | SomeValue | 2019-05-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-01-31 |
| 2 | SomethingElse | 2019-02-01 | 2019-02-28 |
| 2 | SomethingElse | 2019-03-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
结果
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
对于这个版本的问题,我会使用 lag()
来确定组从哪里开始,然后是累加和聚合:
select dimensionid, DimensionAttribute,
min(row_start_date), max(row_end_date)
from (select t.*,
sum(case when prev_red = dateadd(day, -1, row_start_date)
then 0 else 1
end) over (partition by dimensionid, DimensionAttribute order by row_start_date) as grp
from (select t.*,
lag(row_end_date) over (partition by dimensionid, DimensionAttribute order by row_start_date) as prev_red
from t
) t
) t
group by dimensionid, DimensionAttribute, grp;
特别是,这将识别行中的间隙。它只会在行完全匹配时合并行——之前的结束日期比开始日期早一天。当然,这可以进行调整,以允许间隔 1 或 2 天或允许重叠。
我有一个缓慢变化的维度类型 2,其中的行是相同的(除了开始日期和结束日期之外)。如何编写漂亮的 SQL 查询来合并相同且具有连接时间段的行?
当前数据
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-01-31 |
| 1 | SomeValue | 2019-02-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-04-30 |
| 1 | SomeValue | 2019-05-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-01-31 |
| 2 | SomethingElse | 2019-02-01 | 2019-02-28 |
| 2 | SomethingElse | 2019-03-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
结果
+-------------+---------------------+--------------+------------+
| DimensionID | DimensionAttribute | RowStartDate | RowEndDate |
+-------------+---------------------+--------------+------------+
| 1 | SomeValue | 2019-01-01 | 2019-02-28 |
| 1 | AnotherValue | 2019-03-01 | 2019-03-31 |
| 1 | SomeValue | 2019-04-01 | 2019-05-31 |
| 2 | SomethingElse | 2019-01-01 | 2019-03-31 |
| 2 | CompletelyDifferent | 2019-04-01 | 2019-04-30 |
| 2 | SomethingElse | 2019-05-01 | 2019-05-31 |
+-------------+---------------------+--------------+------------+
对于这个版本的问题,我会使用 lag()
来确定组从哪里开始,然后是累加和聚合:
select dimensionid, DimensionAttribute,
min(row_start_date), max(row_end_date)
from (select t.*,
sum(case when prev_red = dateadd(day, -1, row_start_date)
then 0 else 1
end) over (partition by dimensionid, DimensionAttribute order by row_start_date) as grp
from (select t.*,
lag(row_end_date) over (partition by dimensionid, DimensionAttribute order by row_start_date) as prev_red
from t
) t
) t
group by dimensionid, DimensionAttribute, grp;
特别是,这将识别行中的间隙。它只会在行完全匹配时合并行——之前的结束日期比开始日期早一天。当然,这可以进行调整,以允许间隔 1 或 2 天或允许重叠。