在 pyspark 中加载 Databricks csv 库

load Databricks csv library in pyspark

我尝试在我使用 Google Dataproc 创建的 spark 集群上加载 databricks csv 库(参见 https://github.com/databricks/spark-csv)。所有这些都使用 PySpark。

我启动 PySpark 并输入:

spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 --verbose

但我得到了这个答案:

Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property: spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
Adding default property: spark.history.fs.logDirectory=file:///var/log/spark/events
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.maxResultSize=937m
Adding default property: spark.shuffle.service.enabled=true
Adding default property: spark.yarn.historyServer.address=fb-cluster-1-m:18080
Adding default property: spark.driver.memory=1874m
Adding default property: spark.dynamicAllocation.maxExecutors=100000
Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
Adding default property: spark.yarn.am.memory=2176m
Adding default property: spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
Adding default property: spark.master=yarn-client
Adding default property: spark.executor.memory=2176m
Adding default property: spark.eventLog.dir=file:///var/log/spark/events
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.executor.cores=1
Adding default property: spark.yarn.executor.memoryOverhead=384
Adding default property: spark.dynamicAllocation.minExecutors=1
Adding default property: spark.dynamicAllocation.initialExecutors=100000
Adding default property: spark.akka.frameSize=512
Error: Must specify a primary resource (JAR or Python or R file)
Run with --help for usage help or --verbose for debug output

这与文档 https://github.com/databricks/spark-csv 与 post lebigot 相矛盾 https://github.com/databricks/spark-csv/issues/59

有人可以帮助我吗?

看起来您正在尝试 运行 pyspark shell 中的 spark-submit 命令。重要的是要注意 spark-submit 命令用于 configuring and launching bundled applications on a cluster, whereas the spark-shell or pyspark commands are used for creating a shell environment with a pre-instantiated SparkContext for you to run spark commands in the context of the shell. Command line usage of the shellspark-submit 非常相似,因此在您的情况下,如果需要,您必须像下面这样启动 shell包含 spark-csv 包:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0 

为了回答您评论中的其他问题,提供给 --packages 标志的输入是 Maven 坐标列表,这些坐标映射到要搜索并添加到 driver/executor(s ) 作业开始前的类路径。默认情况下,搜索的存储库将是您的本地 Maven 存储库和 Maven 中央存储库(以及 --repositories 标志下定义的任何其他存储库)。如果您之前在本地 Maven 存储库中没有该包,它将从 Maven 中央下载,然后在您再次使用该 jar 时从本地获取。