了解 pyspark 中的罐子

Question

我是 spark 的新手，我的理解是这样的：

jars 就像一堆 java 代码文件
我安装的每个内部使用 spark（或 pyspark）的库都有自己的 jar 文件，驱动程序和执行程序都需要这些文件，以便它们执行包 API 用户调用互动。这些 jar 文件就像那些 API 调用

问题：

为什么需要这些 jar 文件。为什么 python 中的所有代码还不够？（我想答案是最初 Spark 是用 scala 编写的，它在那里将其依赖项作为 jar 分发。因此不必再次创建代码库山，python 库只需调用 java 中的代码python 解释器通过一些转换器将 java 代码转换为等价的 python 代码。如果我理解正确的话）
您在通过 spark.driver.extraClassPath 和 spark.executor.extraClassPath 创建 spark 上下文时指定了这些 jar 文件位置。虽然我猜这些是过时的参数。最近指定这些 jar 文件位置的方法是什么？
我在哪里可以找到我安装的每个库的这些 jar？例如突触。关于包的 jar 文件位于何处的一般想法是什么？为什么这些库不明确说明它们的特定 jar 文件的位置？

我知道我在这里可能没有意义，我上面提到的部分只是我的预感，这就是它必须发生的方式。

那么，你能帮我理解 jar 的整个业务以及如何找到和指定它们吗？

Answer 1

Each library that I install that internally uses spark (or pyspark) has its own jar files

你能告诉我你要安装哪个库吗？

是的，即使您在 python 中编写代码，外部库也可以有 jar。

为什么？

这些库必须使用一些 UDF（用户定义函数）。 Spark 在 java 运行时运行代码。如果这些 UDF 写在 python 中，那么由于将数据转换为 python.

可读的东西，将会有很多序列化和反序列化时间

Java 和 Scala UDF 通常更快，这就是为什么一些库附带 jars。

Why could it not have sufficed to have all the code in python?

同样的原因，scala/java UDF 比 python UDF 快。

What is the recent way to specify these jar files location?

可以使用spark.jars.packages属性。它将复制到驱动程序和执行程序。

Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located?

https://github.com/microsoft/SynapseML#python

他们在这里提到了需要什么罐子，即 com.microsoft.azure:synapseml_2.12:0.9.4

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
            .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
            .getOrCreate()
import synapse.ml

你能试试上面的代码片段吗？

了解 pyspark 中的罐子

Understanding the jars in pyspark

apache-spark

pyspark

spark-koalas