在 pyspark 中使用 jdbc jar

Question

我需要在 pyspark 中读取 postgres sql 数据库。我知道之前有人问过这个问题，例如 here, here 和许多其他地方，但是，那里的解决方案要么使用本地运行目录中的 jar，要么手动将其复制到所有工作人员。

我下载了 postgresql-9.4.1208 jar 并将其放在 /tmp/jars 中。然后我继续使用 --jars 和 --driver-class-path 开关调用 pyspark:

pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar

我在 pyspark 中做了：

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()

然而，虽然使用 --jars 和 --driver-class-path 对我创建的 jars 工作正常，但对 jdbc 它失败了，我从工作人员那里得到了一个例外：

 java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver

如果我手动将 jar 复制到所有工作人员并添加 --conf spark.executor.extraClassPath 和 --conf spark.driver.extraClassPath，它确实有效（使用相同的 jar）。 documentation 顺便说一句建议使用 SPARK_CLASSPATH ，它实际上添加了这两个开关（但有防止使用我需要做的 --jars 选项添加其他罐子的副作用）

所以我的问题是：jdbc 驱动程序有什么特别之处导致它无法工作，我如何添加它而不必手动将其复制给所有工作人员。

更新：

我做了更多的查找，并在文档中找到了这个： "The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs."。

问题是我似乎找不到 computer_classpath.sh 也不明白原始 class 加载程序的含义。

我确实找到了 this，这基本上说明了这需要在本地完成。我还发现 this 基本上说有一个修复程序，但它在 1.6.1 版本中尚不可用。

Answer 1

您正在查看哪个版本的文档？似乎 compute-classpath.sh 被弃用了一段时间 - 从 Spark 1.3.1 开始：

$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
 6592  2015-04-11 00:04   spark-1.3.1/bin/compute-classpath.sh

$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa

什么都不产生。

我认为你应该使用 load-spark-env.sh 来设置你的类路径：

$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh

并且您需要在 $SPARK_HOME/conf/spark-env.sh 文件中设置 SPARK_CLASSPATH（您将从模板文件 $SPARK_HOME/conf/spark-env.sh.template 复制）。

Answer 2

我找到了一个有效的解决方案（不知道它是否是最好的，请随时继续发表评论）。显然，如果我添加选项：driver="org.postgresql.Driver"，这将正常工作。即我的全行（在 pyspark 内）是：

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()

另一件事：如果您已经在使用自己的 fat jar（我在我的完整应用程序中），那么您需要做的就是将 jdbc 驱动程序添加到您的 pom 文件中：

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>9.4.1208</version>
    </dependency>

然后您不必将驱动程序添加为单独的 jar，只需使用具有依赖项的 jar。

Answer 3

我认为这是此处描述和修复的问题的另一种表现形式：https://github.com/apache/spark/pull/12000。我在 3 周前编写了该修复程序，但没有任何进展。也许如果其他人也表达他们受到它影响的事实，它可能会有所帮助？

在 pyspark 中使用 jdbc jar

Working with jdbc jar in pyspark

postgresql

jdbc

apache-spark

pyspark

pyspark-sql