是否可以在 Azure 数据块中使用基于镶木地板文件名的增量 table 跟踪器？

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

今天在工作中，我看到了一个基于文件名的 delta lake 跟踪器。通过增量跟踪器，我的意思是定义镶木地板文件是否已经被摄取的函数。

该代码将检查哪个文件（来自增量 table）未已经被摄取，增量 table 中的镶木地板文件将然后使用此阅读：spark.createDataFrame(path,StringType())

在使用 Delta tables 后，我觉得这样使用 delta 跟踪器 似乎不太合适.

如果记录被删除，delta log指向新文件的可能性有多大，这个删除的记录会被当作一个新的来读？
万一记录更新，delta log不指向新文件的可能性有多大，这个更新的记录不会被考虑?
万一维护发生在 delta table，什么是一些新文件被无缘无故写入的可能性？ 这可能会导致记录被重新摄取

任何观察或建议是否可以那样工作都很好。谢谢

在 Delta Lake 中，一切都在文件级别上运行。所以没有 'in place' 更新或删除。假设一条记录被删除（或更新），然后大致发生以下情况：

读入带有相关记录的镶木地板文件（+恰好在文件中的其他记录）
将除删除记录外的所有记录写入新的parquet文件
用新版本更新事务日志，将旧的 parquet 文件标记为已删除，将新的 parquet 文件标记为已添加。请注意，在您运行 VACUUM 命令之前，旧的镶木地板文件不会被物理删除。

更新过程基本相同。

更具体地回答您的问题：

In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would be read as a new one?

增量日志将指向一个新文件，但删除的记录不会在那里。将有碰巧在原始文件中的所有其他记录。

In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record would not be considered ?

文件未就地更新，因此不会发生这种情况。写入包含更新记录的新文件（+ 原始文件中的任何其他记录）。事务日志更新为 'point' 这个新文件。

In case some maintenance is happening on the delta table, what are the chances that some new files are written out of nowhere ? Which may cause a record to be re-ingested

这是可能的，虽然不是 'out of nowhere'。例如，如果您运行 OPTIMIZE 现有镶木地板文件得到 reshuffled/combined 以提高性能。基本上这意味着将写入许多新的 parquet 文件，并且事务日志中的新版本将指向这些 parquet 文件。如果您在此之后选择所有新文件，您将 re-ingest 数据。

一些注意事项：如果您的增量 table 是仅追加的，您可以使用 structured streaming to read from it instead. If not then Databricks offers Change Data Feed 提供插入、更新和删除的记录级别详细信息。

是否可以在 Azure 数据块中使用基于镶木地板文件名的增量 table 跟踪器？

Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

azure

apache-spark

pyspark

azure-databricks

delta-lake