GCP Dataproc - 在提交作业时添加多个包(kafka,mongodb)不起作用

GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting jobs not working

我试图在提交 dataproc pyspark 作业时添加 kafka 和 mongoDB 包,但是失败了。 到目前为止,我一直只使用 kafka 包,而且工作正常, 然而,当我尝试在下面的命令中添加 mongoDB 包时,它给出了错误

命令运行良好,只有 Kafka 包

gcloud dataproc jobs submit pyspark main.py \
  --cluster versa-structured-stream  \
  --properties spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, spark.dynamicAllocation.enabled=true,spark.shuffle.service.enabled=true

我尝试了几个选项来添加这两个包,但是这不起作用: 例如

--properties ^#^spark:spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2,spark:spark.dynamicAllocation.enabled=true,spark:spark.shuffle.service.enabled=true,spark:spark.executor.memory=20g,spark:spark.driver.memory=5g,spark:spark.executor.cores=2 \
  --jars=gs://dataproc-spark-jars/spark-avro_2.12-3.1.2.jar,gs://dataproc-spark-jars/isolation-forest_2.4.3_2.12-2.0.8.jar,gs://dataproc-spark-jars/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
  --region us-east1 \
  --py-files streams.zip,utils.zip


Error :
Traceback (most recent call last):
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 303, in <module>
    sys.exit(main())
  File "/tmp/1abcccefa3144660967606f3f7f9491d/main.py", line 260, in main
    df_stream = spark.readStream.format('kafka') \
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 482, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".

我该怎么做?

蒂亚!

在您的 --properties 中,您已将 ^#^ 定义为分隔符。要正确使用分隔符,您需要将 , 更改为 # 作为属性的分隔符。在单个键中定义多个值时,您将仅使用 ,。您还需要删除属性上的 spark: 前缀。请参阅下面的示例命令:

gcloud dataproc jobs submit pyspark main.py \
  --cluster=cluster-3069  \
  --region=us-central1 \
  --properties ^#^spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.mongodb.spark:mongo-spark-connector_2.12:3.0.2#spark.dynamicAllocation.enabled=true#spark.shuffle.service.enabled=true#spark.executor.memory=20g#spark.driver.memory=5g#spark.executor.cores=2

检查作业配置时,结果如下: