Python Graphframes:安装依赖项时遇到问题
Python Graphframes: trouble installing dependencies
我正在尝试 运行 simple Graphframes example。我同时拥有 Python 3.6.8 和 Python 2.7.15,以及 Apache Maven 3.6.0、Java 1.8.0、Apache Spark 2.4.4 和 Scala 代码 运行ner 版本 2.11.12.
我收到这个错误:
An error occurred while calling o58.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
我试图启动 this solution,但我卡在了第 2 步。
I 运行 pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
并得到以下输出:
Python 2.7.15+ (default, Jul 9 2019, 16:51:35)
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/jessica/.ivy2/cache
The jars for the packages stored in: /home/jessica/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1be543dc-eac1-4324-bef5-4bab70bd9c95;1.0
confs: [default]
downloading file:/home/jessica/.m2/repository/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11.jar ..
[SUCCESSFUL ] graphframes#graphframes;0.7.0-spark2.4-s_2.11!graphframes.jar (18ms)
downloading file:/home/jessica/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.16!slf4j-api.jar (13ms)
:: resolution report :: resolve 786773ms :: artifacts dl 67ms
:: modules in use:
graphframes#graphframes;0.7.0-spark2.4-s_2.11 from local-m2-cache in [default]
org.slf4j#slf4j-api;1.7.16 from spark-list in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 1 | 0 || 2 | 2 |
---------------------------------------------------------------------
:: problems summary ::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-sources.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-sources.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-src.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-src.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-javadoc.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-javadoc.jar (java.net.ConnectException: Connection timed out (Connection timed out))
unknown resolver sbt-chain
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: org.apache.spark#spark-submit-parent-1a173e58-c356-43d7-9112-b06817ef3674
confs: [default]
2 artifacts copied, 0 already retrieved (411kB/27ms)
he19/10/25 10:39:01 WARN Utils: Your hostname, jessica-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/10/25 10:39:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
lp19/10/25 10:39:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.nio.file.NoSuchFileException: /tmp/tmp6pP3C_/connection6206654157170594455.info
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.createFile(Files.java:632)
at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
at java.nio.file.Files.createTempFile(Files.java:852)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:70)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
不用说这不是预期的输出,所有超时的链接都会导致 404。我的 PC 在代理后面,但代理设置是在 Maven 设置文件中配置的,我知道它们可以正常工作。
是否还有其他代理设置要更改?是否有另一种方法来安装这些依赖项?
编辑
我将 /usr/share/jupyter/kernels/python3/kernel.json
文件更改为:
{
"argv": [
"/usr/bin/python3",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_SUBMIT_ARGS": "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 --master local[10] pyspark-shell"
},
"display_name": "Python 3",
"language": "python"
}
然后尝试在 Jupyter Notebook 中 运行 我的 Python 脚本。这没有用。事实上,一旦我 运行 我的 Python 脚本(在它导入所需的导入后,它崩溃)
,它就会立即导致这个错误
编辑 2
我调整了我的 Firefox 并自己下载了文件。
-rw-rw-r-- 1 jessica jessica 381110 Oct 22 12:17 graphframes-0.7.0-spark2.4-s_2.11.jar
-rw-rw-r-- 1 jessica jessica 2541 Oct 22 12:14 graphframes-0.7.0-spark2.4-s_2.11.pom
然后我 运行 mvn install:install-file -Dfile=graphframes-0.7.0-spark2.4-s_2.11.jar -DpomFile=graphframes-0.7.0-spark2.4-s_2.11.pom
,尽管该过程成功了,但我仍然无法 运行 我的脚本(仍然出于同样的原因)。但是,现在我的 Maven 存储库中有一个 graphframes
文件夹,其中包含所有必需的文件。
编辑 3
我已经卸载并重新安装了 Jupyter、notebook、graphframes、toree、iPython 并添加了 Anaconda - Python 2.7 和 Python 3. 我无法安装Python/Pyspark 的 Apache Toree 内核 (v0.3.0)(我有 SQL 和 Scala,显然 Python/Pyspark 内核不再受支持 - 也欢迎为此提供解决方案)。
我的SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7
变量也设置好了,还有PYSPARK_DRIVER_PYTHON="jupyter"
和PYSPARK_DRIVER_PYTHON_OPTS="notebook"
.
解决了这个问题
长话短说,把罐子直接放进去 $SPARK_HOME/jars
我看到的和你看到的一样。问题是托管工件的存储库 bintray
shut down on May 2021. When you specify the Maven coordinates for graphframes, you should also provide the current repository 托管此工件。
如下执行 PySpark 对我有用。
pyspark \
--packages graphframes:graphframes:0.8.1-spark2.4-s_2.11 \
--repositories https://repos.spark-packages.org
简韦恩的进一步回答。
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.master('yarn')
.appName('GraphFrames_Test')
.config("spark.jars.packages", "graphframes:graphframes:0.8.1-spark2.4-s_2.11")
.config("spark.jars.repositories", "https://repos.spark-packages.org")
.getOrCreate()
)
我正在尝试 运行 simple Graphframes example。我同时拥有 Python 3.6.8 和 Python 2.7.15,以及 Apache Maven 3.6.0、Java 1.8.0、Apache Spark 2.4.4 和 Scala 代码 运行ner 版本 2.11.12.
我收到这个错误:
An error occurred while calling o58.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
我试图启动 this solution,但我卡在了第 2 步。
I 运行 pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
并得到以下输出:
Python 2.7.15+ (default, Jul 9 2019, 16:51:35)
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/jessica/.ivy2/cache
The jars for the packages stored in: /home/jessica/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1be543dc-eac1-4324-bef5-4bab70bd9c95;1.0
confs: [default]
downloading file:/home/jessica/.m2/repository/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11.jar ..
[SUCCESSFUL ] graphframes#graphframes;0.7.0-spark2.4-s_2.11!graphframes.jar (18ms)
downloading file:/home/jessica/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.16!slf4j-api.jar (13ms)
:: resolution report :: resolve 786773ms :: artifacts dl 67ms
:: modules in use:
graphframes#graphframes;0.7.0-spark2.4-s_2.11 from local-m2-cache in [default]
org.slf4j#slf4j-api;1.7.16 from spark-list in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 1 | 1 | 0 || 2 | 2 |
---------------------------------------------------------------------
:: problems summary ::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-sources.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-sources.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-src.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-src.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-javadoc.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11-javadoc.jar (java.net.ConnectException: Connection timed out (Connection timed out))
unknown resolver sbt-chain
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
:: retrieving :: org.apache.spark#spark-submit-parent-1a173e58-c356-43d7-9112-b06817ef3674
confs: [default]
2 artifacts copied, 0 already retrieved (411kB/27ms)
he19/10/25 10:39:01 WARN Utils: Your hostname, jessica-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/10/25 10:39:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
lp19/10/25 10:39:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.nio.file.NoSuchFileException: /tmp/tmp6pP3C_/connection6206654157170594455.info
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.createFile(Files.java:632)
at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
at java.nio.file.Files.createTempFile(Files.java:852)
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:70)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
不用说这不是预期的输出,所有超时的链接都会导致 404。我的 PC 在代理后面,但代理设置是在 Maven 设置文件中配置的,我知道它们可以正常工作。
是否还有其他代理设置要更改?是否有另一种方法来安装这些依赖项?
编辑
我将 /usr/share/jupyter/kernels/python3/kernel.json
文件更改为:
{
"argv": [
"/usr/bin/python3",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_SUBMIT_ARGS": "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 --master local[10] pyspark-shell"
},
"display_name": "Python 3",
"language": "python"
}
然后尝试在 Jupyter Notebook 中 运行 我的 Python 脚本。这没有用。事实上,一旦我 运行 我的 Python 脚本(在它导入所需的导入后,它崩溃)
,它就会立即导致这个错误编辑 2
我调整了我的 Firefox 并自己下载了文件。
-rw-rw-r-- 1 jessica jessica 381110 Oct 22 12:17 graphframes-0.7.0-spark2.4-s_2.11.jar
-rw-rw-r-- 1 jessica jessica 2541 Oct 22 12:14 graphframes-0.7.0-spark2.4-s_2.11.pom
然后我 运行 mvn install:install-file -Dfile=graphframes-0.7.0-spark2.4-s_2.11.jar -DpomFile=graphframes-0.7.0-spark2.4-s_2.11.pom
,尽管该过程成功了,但我仍然无法 运行 我的脚本(仍然出于同样的原因)。但是,现在我的 Maven 存储库中有一个 graphframes
文件夹,其中包含所有必需的文件。
编辑 3
我已经卸载并重新安装了 Jupyter、notebook、graphframes、toree、iPython 并添加了 Anaconda - Python 2.7 和 Python 3. 我无法安装Python/Pyspark 的 Apache Toree 内核 (v0.3.0)(我有 SQL 和 Scala,显然 Python/Pyspark 内核不再受支持 - 也欢迎为此提供解决方案)。
我的SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7
变量也设置好了,还有PYSPARK_DRIVER_PYTHON="jupyter"
和PYSPARK_DRIVER_PYTHON_OPTS="notebook"
.
长话短说,把罐子直接放进去 $SPARK_HOME/jars
我看到的和你看到的一样。问题是托管工件的存储库 bintray
shut down on May 2021. When you specify the Maven coordinates for graphframes, you should also provide the current repository 托管此工件。
如下执行 PySpark 对我有用。
pyspark \
--packages graphframes:graphframes:0.8.1-spark2.4-s_2.11 \
--repositories https://repos.spark-packages.org
简韦恩的进一步回答。
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.master('yarn')
.appName('GraphFrames_Test')
.config("spark.jars.packages", "graphframes:graphframes:0.8.1-spark2.4-s_2.11")
.config("spark.jars.repositories", "https://repos.spark-packages.org")
.getOrCreate()
)