使用过滤器与 basePath+full-filter-path 读取 spark 数据集有区别吗？

Question

关于按某列划分的数据集的读取效率，有没有区别：

// (1) read all dataset then filter
spark.read.parquet("/root/path").filter(col("mycolumn") === 42)

和

// (2) read directly the required data subset
spark.read.option("basePath", "/root/path").parquet("/root/path/mycolumn=42")

?

我在数据文件未存储在与 spark 相同的集群中（因此，没有数据位置）的上下文中提出这个问题。我特别想知道在情况 (1) 中，它是否会检索 spark 集群上的完整数据集文件，然后对其进行过滤（希望不实际读取文件），或者过滤器是否会在检索文件之前实际完成是我希望案例 (2) 做的。

Answer 1

差别很大。

在第一种情况下，您将读取所有文件然后进行过滤，在第二种情况下，您将仅读取所选文件（过滤已由分区完成）。

您可以使用 explain() 函数检查过滤器是否为谓词下推。在您的 FileScan parquet 中，您会看到 PushedFilters 和 PartitionFilters

在你的情况下，你应该读取没有过滤器的分区数据。

spark.read.option("basePath", "/root/path").parquet("/root/path/mycolumn=42")

Is there a difference in reading spark dataset using filter vs basePath+full-filter-path?