GCP Dataproc - 在提交作业时添加多个包(kafka,mongodb)不起作用
GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting jobs not working
我试图在提交 dataproc pyspark 作业时添加 kafka 和 mongoDB 包,但是失败了。
到目前为止,我一直只使用 kafka 包,而且工作正常,
然而,当我尝试在下面的命令中添加 mongoDB 包时,它给出了错误
命令运行良好,只有 Kafka 包
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
我尝试了几个选项来添加这两个包,但是这不起作用:
例如
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
--region us-east1 \
--py-files streams.zip,utils.zip
Error :
Traceback (most recent call last):
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
sys.exit(main())
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
df_stream = spark.readStream.format('kafka') \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
我该怎么做?
蒂亚!
在您的 --properties
中,您已将 ^#^
定义为分隔符。要正确使用分隔符,您需要将 ,
更改为 #
作为属性的分隔符。在单个键中定义多个值时,您将仅使用 ,
。您还需要删除属性上的 spark:
前缀。请参阅下面的示例命令:
gcloud dataproc jobs submit pyspark main.py \
--cluster=cluster-3069 \
--region=us-central1 \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2
检查作业配置时,结果如下:
我试图在提交 dataproc pyspark 作业时添加 kafka 和 mongoDB 包,但是失败了。 到目前为止,我一直只使用 kafka 包,而且工作正常, 然而,当我尝试在下面的命令中添加 mongoDB 包时,它给出了错误
命令运行良好,只有 Kafka 包
gcloud dataproc jobs submit pyspark main.py \
--cluster versa-structured-stream \
--properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true
我尝试了几个选项来添加这两个包,但是这不起作用: 例如
--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
--jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
--region us-east1 \
--py-files streams.zip,utils.zip
Error :
Traceback (most recent call last):
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
sys.exit(main())
File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
df_stream = spark.readStream.format('kafka') \
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
我该怎么做?
蒂亚!
在您的 --properties
中,您已将 ^#^
定义为分隔符。要正确使用分隔符,您需要将 ,
更改为 #
作为属性的分隔符。在单个键中定义多个值时,您将仅使用 ,
。您还需要删除属性上的 spark:
前缀。请参阅下面的示例命令:
gcloud dataproc jobs submit pyspark main.py \
--cluster=cluster-3069 \
--region=us-central1 \
--properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2
检查作业配置时,结果如下: