数据湖不变性规则的例外
Exceptions from Data Lake immutability rule
Data Lake should be immutable:
It is important that all data put in the lake should have a clear
provenance in place and time. Every data item should have a clear
trace to what system it came from and when the data was produced. The
data lake thus contains a historical record. This might come from
feeding Domain Events into the lake, a natural fit with Event Sourced
systems. But it could also come from systems doing a regular dump of
current state into the lake - an approach that's valuable when the
source system doesn't have any temporal capabilities but you want a
temporal analysis of its data. A consequence of this is that data put
into the lake is immutable, an observation once stated cannot be
removed (although it may be refuted later), you should also expect
ContradictoryObservations.
规则是否有任何例外情况,在数据湖中覆盖数据可能被认为是一种好的做法?我想没有,但有些队友有不同的理解。
我认为在累积算法的情况下需要数据来源和可追溯性,以便能够重现最终状态。如果最终状态不依赖于先前的结果怎么办?有人说只有累积算法才需要数据湖中的数据湖不变性(事件源)吗?
例如,您每天全负荷摄取 tables A 和 B,然后计算 table C。如果用户只对C 的最新结果,是否有任何理由保留 A、B 和 C 的历史记录(基于日期分区的事件源)?
另一个问题可能是 ACID 合规性 - 您的文件可能已损坏或部分写入。但是假设我们正在讨论可以从源系统轻松恢复 A 和 B 的最新状态的情况。
Are there any expceptions from rule, where it may be considered a good practice to overwrite data in Data Lake?
最好不要覆盖数据湖中的数据。以防某些事件因错误或错误而生成。应该产生补偿先前事件的新事件。这样,Datalake 就会记录所有的事件历史记录,包括补偿事件和最终的再处理。
I think that data provenance and tracebility is needed in case of cummulative algorithm, to be able to reproduce the final state. What if final state isn't dependent on previous results? Is somebody right if he says that Data Lake immutability (event sourcing) in Data Lake are needed only for cummulative algorithms?
DataLake 是所有相关事件的最终归宿。并非所有事件都需要记录在数据湖中。通常,我们区分 operational/communication 和业务事件。 DataLake 中记录的业务事件可用于重新处理或用于依赖于事件历史的新功能。也可以生成不依赖于事件历史的孤立事件并将其添加到历史中。因此,我们可以推断最终状态不违反不变性原则。对于一组在时间上连续的不可变事件,我们总能产生一个最终状态。所以,答案不仅仅针对累积算法。
For example, you have a full-load daily-basis ingestion of tables A and B, afterwards calculate table C. If user is interested only in the latest result of C, are there any reasons to keep history (event sourcing based on date partitioning) of A, B and C?
无法重现事件历史的起始事件。只有在第一个事件之后,我们才能考虑最终状态。在这种特殊情况下,不应将 A 和 B 元组和聚合视为事件。而是计算函数输入。计算函数输入应该作为业务事件记录在数据湖中。最后的事件X(计算输入)产生事件Y。如果事件X没有记录在事件历史中,Y应该被认为是开始事件。
Data Lake should be immutable:
It is important that all data put in the lake should have a clear provenance in place and time. Every data item should have a clear trace to what system it came from and when the data was produced. The data lake thus contains a historical record. This might come from feeding Domain Events into the lake, a natural fit with Event Sourced systems. But it could also come from systems doing a regular dump of current state into the lake - an approach that's valuable when the source system doesn't have any temporal capabilities but you want a temporal analysis of its data. A consequence of this is that data put into the lake is immutable, an observation once stated cannot be removed (although it may be refuted later), you should also expect ContradictoryObservations.
规则是否有任何例外情况,在数据湖中覆盖数据可能被认为是一种好的做法?我想没有,但有些队友有不同的理解。
我认为在累积算法的情况下需要数据来源和可追溯性,以便能够重现最终状态。如果最终状态不依赖于先前的结果怎么办?有人说只有累积算法才需要数据湖中的数据湖不变性(事件源)吗?
例如,您每天全负荷摄取 tables A 和 B,然后计算 table C。如果用户只对C 的最新结果,是否有任何理由保留 A、B 和 C 的历史记录(基于日期分区的事件源)?
另一个问题可能是 ACID 合规性 - 您的文件可能已损坏或部分写入。但是假设我们正在讨论可以从源系统轻松恢复 A 和 B 的最新状态的情况。
Are there any expceptions from rule, where it may be considered a good practice to overwrite data in Data Lake?
最好不要覆盖数据湖中的数据。以防某些事件因错误或错误而生成。应该产生补偿先前事件的新事件。这样,Datalake 就会记录所有的事件历史记录,包括补偿事件和最终的再处理。
I think that data provenance and tracebility is needed in case of cummulative algorithm, to be able to reproduce the final state. What if final state isn't dependent on previous results? Is somebody right if he says that Data Lake immutability (event sourcing) in Data Lake are needed only for cummulative algorithms?
DataLake 是所有相关事件的最终归宿。并非所有事件都需要记录在数据湖中。通常,我们区分 operational/communication 和业务事件。 DataLake 中记录的业务事件可用于重新处理或用于依赖于事件历史的新功能。也可以生成不依赖于事件历史的孤立事件并将其添加到历史中。因此,我们可以推断最终状态不违反不变性原则。对于一组在时间上连续的不可变事件,我们总能产生一个最终状态。所以,答案不仅仅针对累积算法。
For example, you have a full-load daily-basis ingestion of tables A and B, afterwards calculate table C. If user is interested only in the latest result of C, are there any reasons to keep history (event sourcing based on date partitioning) of A, B and C?
无法重现事件历史的起始事件。只有在第一个事件之后,我们才能考虑最终状态。在这种特殊情况下,不应将 A 和 B 元组和聚合视为事件。而是计算函数输入。计算函数输入应该作为业务事件记录在数据湖中。最后的事件X(计算输入)产生事件Y。如果事件X没有记录在事件历史中,Y应该被认为是开始事件。