Pyspark 无法访问配置单元

Question

简而言之：我有一个在 hdp3 上工作的配置单元，我无法从 pyspark 访问它，运行ning 在 yarn 下（在同一个 hdp 上）。如何让 pyspark 找到我的表？

spark.catalog.listDatabases() 仅显示默认值，任何查询运行都不会显示在我的配置单元日志中。

这是我的代码，使用 spark 2.3.1

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
settings = []
conf = SparkConf().setAppName("Guillaume is here").setAll(settings)
spark = (
    SparkSession
    .builder
    .master('yarn')
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()
)
print(spark.catalog.listDatabases())

注意 settings 是空的。我认为这就足够了，因为在日志中我看到

loading hive config file: file:/etc/spark2/3.0.1.0-187/0/hive-site.xml

更有趣的是

Registering function intersectgroups io.x.x.IntersectGroups

这是我写的一个UDF，手动添加到hive中。这意味着完成了某种连接。

我得到的唯一输出（日志除外）是：

[ Database(name=u'default', description=u'default database', locationUri=u'hdfs://HdfsNameService/apps/spark/warehouse')]

我知道我应该在设置中设置 spark.sql.warehouse.dir。无论我是否将它设置为我在 hive-site 中找到的值、我感兴趣的数据库的路径（它不在默认位置）、它的父级，都没有任何变化。

我在设置中添加了许多其他配置选项（包括 thrift uris），没有更改。

我也看到我应该将 hive-site.xml 复制到 spark2 conf 目录中。我在集群的所有节点上都做了，没有变化。

我对运行的命令是：

HDP_VERSION=3.0.1.0-187 PYTHONPATH=.:/usr/hdp/current/spark2-client/python/:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip SPARK_HOME=/usr/hdp/current/spark2-client HADOOP_USER_NAME=hive spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip --files /etc/hive/conf/hive-site.xml ./subjanal/anal.py

Answer 1

在 HDP 3.x 中，您需要使用 the docs 中所述的 Hive 仓库连接器。

Pyspark 无法访问配置单元

Pyspark cannot reach hive

python

hive

hortonworks-data-platform

pyspark