通过 DataFrames 从配置单元视图与配置单元 table 读取时的性能考虑

Question

我们认为联合多个配置单元table。如果我在 pyspark 中使用 spark SQL 并读取该视图，与直接从 table 读取相比，是否会出现任何性能问题。在 hive 中，如果我们不将 where 子句限制为精确的 table 分区，我们有一个称为完整 table 扫描的东西。 spark 是否足够智能，可以直接读取包含我们正在查找的数据的 table 而不是搜索整个视图？请指教

Answer 1

您正在谈论分区 p运行ing。是的，spark 支持它，当指定分区过滤器时，spark 会自动忽略大数据读取。

当 table 中的数据跨多个逻辑分区拆分时，

分区 p运行ing 是可能的。每个分区对应一个分区列的特定值，并作为子目录存储在 HDFS 的 table 根目录中。在适用的情况下，只查询 table 所需的分区（子目录），从而避免不必要的 I/O

对数据进行分区后，后续查询在谓词中引用分区列时可以省略大量的I/O。例如，以下查询自动定位并加载 peoplePartitioned/age=20/ 下的文件并忽略所有其他文件：

val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned") 
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")

提供了更详细的信息here

如果您在查询中运行解释（真），您也可以在逻辑计划中看到这一点：

spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)

它将显示 spark 读取了哪些分区

通过 DataFrames 从配置单元视图与配置单元 table 读取时的性能考虑

Performance consideration when reading from hive view Vs hive table via DataFrames

hive

apache-spark

apache-spark-sql

pyspark

pyspark-sql