查询具有大量列的 Hive table 时是否可以减少 MetaStore 检查的次数？

Question

我在 databricks 上使用 spark sql，它使用 Hive 元存储，我正在尝试设置一个 job/query 使用相当多的列（20+）。

运行元存储验证检查所花费的时间与我的查询中包含的列数成线性比例关系 - 有什么方法可以跳过这一步吗？或者预先计算支票？或者至少让 Metastore 每 table 只检查一次而不是每列检查一次？

一个小例子是，当我在下面运行时，甚至在调用 display 或 collect 之前，Metastore 检查器也会发生一次：

new_table = table.withColumn("new_col1", F.col("col1")

当我在下面运行时，Metastore 检查器会发生多次，因此需要更长的时间：

new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)

Metastore 检查它在驱动程序节点中的表现如下所示：

20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root    ip=unknown-ip-addr    cmd=get_database: xxx

用户在 databricks 上的视图是：

Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...

我很想知道是否有人可以确认这就是它的工作方式（每列一个 Metastore 检查），以及我是否只需要计划 Metastore 检查的开销。

Answer 1

我对这种行为感到惊讶，因为它不适合 Spark 处理模型，我无法在 Scala 中复制它。它可能在某种程度上特定于 PySpark，但我怀疑 PySpark 只是用于创建 Spark 计划的 API。

但是，在每 withColumn(...) 之后都会对计划进行分析。如果计划很大，这可能需要一段时间。但是，有一个简单的优化。将对独立列的多个 withColumn(...) 调用替换为 df.select(F.col("*"), F.col("col2").as("new_col2"), ...)。在这种情况下，将只执行一次分析。

在某些非常大的计划中，我们为单个笔记本单元节省了 10 多分钟的分析时间。

查询具有大量列的 Hive table 时是否可以减少 MetaStore 检查的次数？

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

hive

pyspark

databricks

hive-metastore

azure-databricks