带有 GraphFrames 的 PySpark 异常
PySpark exception with GraphFrames
我正在使用 PySpark 和 GraphFrames 构建一个简单的网络图(运行在 Google Dataproc 上运行)
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)],
["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
然后,我尝试 运行 `label progation'
result = g.labelPropagation(maxIter=5)
但我收到以下错误:
Py4JJavaError: An error occurred while calling o164.run.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 829, cluster-network-graph-w-12.c.myproject-bi.internal, executor 2): java.lang.ClassNotFoundException: org.graphframes.GraphFrame$$anonfun
看起来包 'GraphFrame' 不可用 - 但前提是我 运行 标记传播。我该如何解决?
这似乎是 google Dataproc 中图形框架的一个已知问题。
创建一个 python 文件并添加以下行,然后 运行 它:
from setuptools import setup
setup(name='graphframes',
version='0.5.10',
packages=['graphframes', 'graphframes.lib']
)
详情可以访问这里:
https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172
我已经使用以下参数解决了
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11')])
spark = SparkSession.builder \
.appName('testing bq')\
.config(conf=conf) \
.getOrCreate()
我正在使用 PySpark 和 GraphFrames 构建一个简单的网络图(运行在 Google Dataproc 上运行)
vertices = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)],
["id", "name", "age"])
edges = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
然后,我尝试 运行 `label progation'
result = g.labelPropagation(maxIter=5)
但我收到以下错误:
Py4JJavaError: An error occurred while calling o164.run.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 4 times, most recent failure: Lost task 0.3 in stage 19.0 (TID 829, cluster-network-graph-w-12.c.myproject-bi.internal, executor 2): java.lang.ClassNotFoundException: org.graphframes.GraphFrame$$anonfun
看起来包 'GraphFrame' 不可用 - 但前提是我 运行 标记传播。我该如何解决?
这似乎是 google Dataproc 中图形框架的一个已知问题。
创建一个 python 文件并添加以下行,然后 运行 它:
from setuptools import setup
setup(name='graphframes',
version='0.5.10',
packages=['graphframes', 'graphframes.lib']
)
详情可以访问这里:
https://github.com/graphframes/graphframes/issues/238, https://github.com/graphframes/graphframes/issues/172
我已经使用以下参数解决了
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest.jar'),
('spark.jars.packages', 'graphframes:graphframes:0.7.0-spark2.3-s_2.11')])
spark = SparkSession.builder \
.appName('testing bq')\
.config(conf=conf) \
.getOrCreate()