在离线 Spark 集群中安装 graphframes 包

Installation of graphframes package in an offline Spark cluster

我有一个离线 pyspark 集群(无法访问互联网),我需要在其中安装 graphframes 库。

我已经从添加到 $SPARK_HOME/jars/ 中的 here 手动下载了 jar,然后当我尝试使用它时出现以下错误:

error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.typesafe.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access type LazyLogging in value com.slf4j,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.slf4j.

离线安装所有依赖项的正确方法是什么?

我设法安装了 graphframes 库。首先,我找到了 graphframes 依赖项 ,其中:

scala-logging-api_xx-xx.jar
scala-logging-slf4j_xx-xx.jar

其中 xx 是 scala 和 jar 版本的正确版本。然后我将它们安装在正确的路径中。因为我在 Cloudera 机器上工作,所以正确的路径是:

/opt/cloudera/parcels/SPARK2/lib/spark2/jars/

如果你不能将它们放在集群的这个目录中(因为你没有root权限并且你的管理员超级懒惰)你可以简单地添加你的spark-submit/ spark-shell

spark-submit ..... --driver-class-path /path-for-jar/  \
                   --jars /../graphframes-0.5.0-spark2.1-s_2.11.jar,/../scala-logging-slf4j_2.10-2.1.2.jar,/../scala-logging-api_2.10-2.1.2.jar

这适用于 Scala。为了使用 python 的图框,您需要 下载 graphframes jar 然后通过 shell

#Extract JAR content
 jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
#Enter the folder
 cd graphframes
#Zip the contents
 zip graphframes.zip -r *

然后将压缩文件添加到 spark-env.sh 或 bash_profile 中的 python 路径中

export PYTHONPATH=$PYTHONPATH:/..proper path/graphframes.zip:.

然后打开 shell/submitting(再次使用与 scala 相同的参数)导入 graphframes 正常工作

这个 link 对于这个解决方案非常有用