如何将 HIVE 访问集成到派生自 pip 和 conda 的 PySpark（而不是来自 Spark 发行版或包）

How to integrate HIVE access into PySpark derived from pip and conda (not from a Spark distribution or package)

我通过 conda 和 pip pyspark 从头开始构建并以编程方式使用我的 PySpark 环境（就像我演示的 Here）；而不是使用可下载的 Spark 发行版中的 PySpark。正如您在上面 URL 的第一个代码片段中看到的那样，我通过（除其他外）我的 SparkSession 启动脚本中的 k/v conf-pairs 来完成此操作。（顺便说一下，这种方法使我能够在各种 REPL、IDE 和 JUPYTER 中工作）。

但是，关于配置 Spark 支持访问 HIVE 数据库和元数据存储，手册是这样说的：

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

上面的 conf/ 表示 Spark 分发包中的 conf/ 目录。但是 pyspark 通过 pip 和 conda 当然没有那个目录，那么在这种情况下如何将 HIVE 数据库和 Metastore 支持插入 Spark？

我怀疑这可能是由特殊前缀的 SparkConf K/V 对形式提供的：spark.hadoop.*（参见 here）；如果是，我仍然需要确定需要哪些 HADOOP/HIVE/CORE 指令。我想我会试错的。 :)

注意: .enableHiveSupport() 已经包含。

我会修改 spark.hadoop.* K/V 对，但如果有人知道这是如何临时完成的，请告诉我。

谢谢。 :)

编辑：提供解决方案后，我更新了first URL above中的内容。它现在集成了下面讨论的 SPARK_CONF_DIR 和 HADOOP_CONF_DIR 环境变量方法。

在这种情况下，我建议 the official configuration guide（强调我的）：

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

hdfs-site.xml, which provides default behaviors for the HDFS client.

core-site.xml, which sets the default filesystem name.

(...)

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration files.

Additionally:

To specify a different configuration directory other than the default “SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.

因此可以使用您的 Spark 安装可访问的任意目录来放置所需的配置文件，并且 SPARK_CONF_DIR 和/或 HADOOP_CONF_DIR 可以直接在您的脚本中轻松设置，使用 os.environ.

最后，大多数时候甚至不需要单独的 Hadoop 配置文件，因为可以使用 spark.hadoop.* 前缀直接在 Spark 文档中设置 Hadoop 特定属性。

如何将 HIVE 访问集成到派生自 pip 和 conda 的 PySpark（而不是来自 Spark 发行版或包）

How to integrate HIVE access into PySpark derived from pip and conda (not from a Spark distribution or package)

python

hive

apache-spark

pyspark

hive-metastore