运行 Databricks 优化和真空?

Run both Databricks Optimize and Vacuum?

同时调用 Databricks (Delta) OptimizeVacuum 是否有意义?它看起来很有道理,但我不想只是推断该怎么做。我想问一下。

Vacuum

Recursively vacuum directories associated with the Delta table and remove data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days.

Optimize

Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.

第二个问题:如果答案是肯定的,哪种操作顺序最好?

  1. Optimize 然后 Vacuum
  2. Vacuum 然后 Optimize

是的,您至少需要 运行 这两个命令来清理由 OPTIMIZE 优化的文件。使用默认设置,顺序无关紧要,因为它只会在 7 天后删除文件。仅当您 运行 保留 0 秒的 VACUUM 时,顺序才重要,但无论如何都不推荐这样做,因为它会删除整个历史记录。