数据建模 - 缓慢变化的维度类型 2：如何处理模式变化（添加的列）？

Question

在构建缓慢变化的维度时处理架构变化的最佳实践是什么 table？

比如添加了一列：

First state:
+----------+---------------------+-------------------+
|customerId|address              |updated_at         |
+----------+---------------------+-------------------+
|1         |current address for 1|2018-02-01 00:00:00|
+----------+---------------------+-------------------+


New state with new column, but every other followed column constant:
+----------+---------------------+-------------------+------+
|customerId|address              |updated_at         |newCol|
+----------+---------------------+-------------------+------+
|1         |current address for 1|2018-03-03 00:00:00|1000  |
+----------+---------------------+-------------------+------+

我的第一种方法是认为架构更改意味着行已更改。所以我会在我的 SCD table:

中添加一个新行

+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|customerId|address              |updated_at         |newCol|active_status|active_status_start|active_status_end  |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+
|1         |current address for 1|2018-02-01 00:00:00|null  |false        |2018-02-01 00:00:00|2018-03-03 00:00:00|
|1         |current address for 1|2018-03-03 00:00:00|1000  |true         |2018-03-03 00:00:00|null               |
+----------+---------------------+-------------------+------+-------------+-------------------+-------------------+

但是，如果添加了列，但对于某些特定行，值为空怎么办？例如，对于 customerId = 2 的行，它为空：

+----------+---------------------+-------------------+------+
|customerId|address              |updated_at         |newCol|
+----------+---------------------+-------------------+------+
|2         |current address for 2|2018-03-03 00:00:00|null  |
+----------+---------------------+-------------------+------+

在这种情况下，我可以采取两种方法：

将每个架构更改视为行更改，即使是空行也是如此（更容易实现，但从存储角度来看成本更高）。这将导致：

+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address              |updated_at         |active_status|active_status_end  |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1         |current address for 1|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|1         |current address for 1|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|1000  |
|2         |current address for 2|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|2         |current address for 2|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|null  |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+

检查每一行，如果它有这个新列的实际值，则添加它；否则，不要对这一行做任何事情（目前，我还没有想出它的实现，但它要复杂得多并且可能容易出错）。第 2 行的 SCD table 中的结果将是 'row has not changed':

+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|customerId|address              |updated_at         |active_status|active_status_end  |active_status_start|newCol|
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+
|1         |current address for 1|2018-02-01 00:00:00|false        |2018-03-03 00:00:00|2018-02-01 00:00:00|null  |
|1         |current address for 1|2018-03-03 00:00:00|true         |null               |2018-03-03 00:00:00|1000  |
|2         |current address for 2|2018-02-01 00:00:00|true         |null               |2018-02-01 00:00:00|null  |
+----------+---------------------+-------------------+-------------+-------------------+-------------------+------+

第二种方法似乎更“正确”，但我是对的吗？此外，实现方法 1 更简单。 Approuch 2 需要一些更复杂的东西并且有其他权衡，例如： a) 如果没有添加列，而是删除了列怎么办？ b) 从查询的角度来看，它的成本要高得多。

我对这个问题做过研究，并没有发现这种情况正在接受治疗。

它的标准方法是什么？取舍？我在这里缺少另一种方法吗？

谢谢大家

Answer 1

感谢@MarmiteBomber 和@MatBailie 的评论。根据您的意见，我最终实施了第二个选项，因为（您的想法摘要）：

第二种方法是唯一有意义的。
实施是业务逻辑的结果，不一定是标准做法。在我们的例子中，我们不需要区分 null 的类型，所以正确的做法是将已知的不存在的值封装为 null，以及未知的值等。
要明确。

第二种方法还需要在写入时添加检查（行中是否存在新列？），但它节省了查询时间和存储的复杂性。由于 SCD 很“慢”并且这种情况很少见（模式更改发生，但不是“每天”），添加写入时间检查比查询时间要好。

数据建模 - 缓慢变化的维度类型 2：如何处理模式变化（添加的列）？

Data Modeling - Slow Changing Dimension type 2: How to deal with schema change (column added)?

sql

data-modeling

data-warehouse

dimensional-modeling

apache-spark