spark-submit 如何在集群模式下传递--driver-class-path?
spark-submit how to pass --driver-class-path in cluster mode?
好吧,如果使用带有驱动程序-class-路径的 pyspark shell,我可以使用 docker 图像访问配置单元资源:
$ pyspark --driver-class-path /etc/spark2/conf:/etc/hive/conf
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>>
>>> #declaration
... appName = "test_hive_minimal"
>>> master = "yarn"
>>>
... sc = SparkSession.builder \
... .appName(appName) \
... .master(master) \
... .enableHiveSupport() \
... .config("spark.hadoop.hive.enforce.bucketing", "True") \
... .config("spark.hadoop.hive.support.quoted.identifiers", "none") \
... .config("hive.exec.dynamic.partition", "True") \
... .config("hive.exec.dynamic.partition.mode", "nonstrict") \
... .getOrCreate()
>>> sql = "show tables in user_tables"
>>> df_new = sc.sql(sql)
20/08/20 15:08:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> df_new.show()
+-----------+--------------------+-----------+
| database| tableName|isTemporary|
+-----------+--------------------+-----------+
|user_tables| dummyt| false|
|user_tables|abcdefg...dummytable| false|
但如果通过 spark-submit 使用相同的脚本,则会遇到以下错误:
spark-submit --master local --deploy-mode cluster --name test_hive --executor-memory 2g --num-executors 1 -- test_hive_minimal.py --verbose
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py", line 71, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'user_tables' not found;"
test_hive_minimal.py 是一个检查配置单元 db:
的简单脚本
from pyspark.sql import SparkSession
appName = "test_hive_minimal"
master = "yarn"
# Creating Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
我尝试了几种方法,传递 hive.metastore.uris、spark.sql.warehouse.dir 以及传递 xml 文件作为 --files。不知何故,我的执行者无法访问它似乎的配置。有人可以帮忙吗?
更新:
我成功地将 hive-site.xml 作为 --files 传递给集群模式下的 spark-submit,并且日志显示它不再为 Metastore 创建本地 derby.db。然而,现在面临另一个问题如下:
20/08/21 09:59:29 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/08/21 09:59:31 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/21 09:59:31 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster01.cdh.com:9083
20/08/21 09:59:32 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
似乎是 kerberos 问题,但我已经拥有有效的 kerberos 令牌并且能够通过终端/也可以通过来自 docker 的 spark-shell 访问 hdfs。这里需要做什么?在集群上提交时,这不是由 yarn 自动配置的吗?
我认为你应该在 spark-submit 命令中传递密钥表,此代码是 运行 通过 SSH?
更新:
在 docker 容器上共享 vol 装载并传递 keytab/principal 和 hive-site。xml 以访问 Metastore 后问题已解决。
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias@realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py
好吧,如果使用带有驱动程序-class-路径的 pyspark shell,我可以使用 docker 图像访问配置单元资源:
$ pyspark --driver-class-path /etc/spark2/conf:/etc/hive/conf
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>> from pyspark.sql import SparkSession
>>>
>>> #declaration
... appName = "test_hive_minimal"
>>> master = "yarn"
>>>
... sc = SparkSession.builder \
... .appName(appName) \
... .master(master) \
... .enableHiveSupport() \
... .config("spark.hadoop.hive.enforce.bucketing", "True") \
... .config("spark.hadoop.hive.support.quoted.identifiers", "none") \
... .config("hive.exec.dynamic.partition", "True") \
... .config("hive.exec.dynamic.partition.mode", "nonstrict") \
... .getOrCreate()
>>> sql = "show tables in user_tables"
>>> df_new = sc.sql(sql)
20/08/20 15:08:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> df_new.show()
+-----------+--------------------+-----------+
| database| tableName|isTemporary|
+-----------+--------------------+-----------+
|user_tables| dummyt| false|
|user_tables|abcdefg...dummytable| false|
但如果通过 spark-submit 使用相同的脚本,则会遇到以下错误:
spark-submit --master local --deploy-mode cluster --name test_hive --executor-memory 2g --num-executors 1 -- test_hive_minimal.py --verbose
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/conda/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/conda/lib/python3.7/site-packages/pyspark/sql/utils.py", line 71, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'user_tables' not found;"
test_hive_minimal.py 是一个检查配置单元 db:
的简单脚本from pyspark.sql import SparkSession
appName = "test_hive_minimal"
master = "yarn"
# Creating Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
我尝试了几种方法,传递 hive.metastore.uris、spark.sql.warehouse.dir 以及传递 xml 文件作为 --files。不知何故,我的执行者无法访问它似乎的配置。有人可以帮忙吗?
更新: 我成功地将 hive-site.xml 作为 --files 传递给集群模式下的 spark-submit,并且日志显示它不再为 Metastore 创建本地 derby.db。然而,现在面临另一个问题如下:
20/08/21 09:59:29 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/08/21 09:59:31 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/21 09:59:31 INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster01.cdh.com:9083
20/08/21 09:59:32 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
似乎是 kerberos 问题,但我已经拥有有效的 kerberos 令牌并且能够通过终端/也可以通过来自 docker 的 spark-shell 访问 hdfs。这里需要做什么?在集群上提交时,这不是由 yarn 自动配置的吗?
我认为你应该在 spark-submit 命令中传递密钥表,此代码是 运行 通过 SSH?
更新: 在 docker 容器上共享 vol 装载并传递 keytab/principal 和 hive-site。xml 以访问 Metastore 后问题已解决。
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias@realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py