Databricks (Spark SQL) 表的索引

Indexes for Databricks (Spark SQL) tables

indexing
apache-spark-sql
databricks
delta-lake

想知道索引在 Databricks 中是如何工作的。您是否可以将分区视为索引，因为它有效地组织分组子类别中的数据？

是的，分区可以看作是一种索引 - 它允许您直接跳转到必要的数据而无需读取整个数据集。

对于 databricks delta 还有另一个功能 - Data Skipping. When writing data to Delta, the writer is collecting statistics (for example, min & max values) for first N columns (32 by default) and write that statistics into Delta log, so when we filter data by indexed column, we know if given file may contain given data or not. Another indexing technique for databricks delta is bloom filtering 显示特定值是否肯定不在文件中，或者可能在文件中。

2022 年 4 月 14 日更新：从 1.2.0 版开始，OSS Delta 也可以使用数据跳过

Databricks (Spark SQL) 表的索引

Indexes for Databricks (Spark SQL) tables

indexing

apache-spark-sql

databricks

delta-lake