将图框与 PyCharm 一起使用
Using graphframes with PyCharm
我花了将近 2 天的时间在互联网上滚动,但我无法解决这个问题。我正在尝试通过 PyCharm 将 graphframes package(版本:0.2.0-spark2.0-s_2.11)安装到 运行,但是,尽管我尽了最大努力,这是不可能的。
我几乎什么都试过了。请知道,在发布答案之前我也检查了这个网站 here。
这是我正在尝试的代码 运行:
# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd
# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME
# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.graphframes import GraphFrame
except ImportError as ex:
print "Can not import Spark Modules", ex
sys.exit(1)
# GLOBAL VARIABLES --------------------------------------------------------- -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)
# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":
# Main Path to CSV files
DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
FILE_NAME = 'gene_gene_associations_50k.csv'
# LOAD DATA CSV USING PANDAS -----------------------------------------------------------------#
print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
# Read csv file and load as df
GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))
# Remove duplicates
GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')
# Name Columns
GENES_DF_CLEAN.columns = ['gene_id']
# Output dataFrame
print GENES_DF_CLEAN
# Create vertices
VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)
# Show some vertices
print VERTICES.take(5)
print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
# Read csv file and load as df
EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))
# Name Columns
EDGES_DF.columns = ["src", "dst", "rel_type"]
# Output dataFrame
print EDGES_DF
# Create vertices
EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)
# Show some edges
print EDGES.take(5)
g = gf.GraphFrame(VERTICES, EDGES)
不用说,我已经尝试将 graphframes 目录(查看 here 以了解我所做的)包含到 spark 的 pyspark 目录中。但这似乎还不够......我尝试过的任何其他方法都失败了。将不胜感激一些帮助。您可以在下面看到我收到的错误消息:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
STEP 1: Loading Gene Nodes -------------------------------------------------------------
gene_id
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9 APP
10 APLP1
11 CITED2
12 EP300
13 APOB
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30 MAGI1
31 RB1
32 TSG101
33 ORC1
... ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424 KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548 SYF2
49553 FOXN3
49557 MLANA
49608 REPIN1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
src dst rel_type
0 MAP2K4 FLNC Two-hybrid
1 MYPN ACTN2 Two-hybrid
2 ACVR1 FNTA Two-hybrid
3 GATA2 PML Two-hybrid
4 RPA2 STAT3 Two-hybrid
5 ARF1 GGA3 Two-hybrid
6 ARF3 ARFIP2 Two-hybrid
7 ARF3 ARFIP1 Two-hybrid
8 XRN1 ALDOA Two-hybrid
9 APP APPBP2 Two-hybrid
10 APLP1 DAB1 Two-hybrid
11 CITED2 TFAP2A Two-hybrid
12 EP300 TFAP2A Two-hybrid
13 APOB MTTP Two-hybrid
14 ARRB2 RALGDS Two-hybrid
15 CSF1R GRB2 Two-hybrid
16 PRRC2A GRB2 Two-hybrid
17 LSM1 NARS Two-hybrid
18 SLC4A1 SLC4A1AP Two-hybrid
19 BCL3 BARD1 Two-hybrid
20 ADRB1 GIPC1 Two-hybrid
21 BRCA1 ATF1 Two-hybrid
22 BRCA1 MSH2 Two-hybrid
23 BRCA1 BARD1 Two-hybrid
24 BRCA1 MSH6 Two-hybrid
25 ARVCF CDH15 Two-hybrid
26 PCBD1 CACNA1C Two-hybrid
27 PSEN2 CAPN1 Two-hybrid
28 CAPN3 TTN Two-hybrid
29 ITPR1 CA8 Two-hybrid
... ... ... ...
49969 SAP30 HDAC3 Affinity Capture-Western
49970 BRCA1 RBBP8 Co-localization
49971 BRCA1 BRCA1 Biochemical Activity
49972 SET TREX1 Co-purification
49973 SET TREX1 Reconstituted Complex
49974 PLAGL1 EP300 Reconstituted Complex
49975 PLAGL1 CREBBP Reconstituted Complex
49976 EP300 PLAGL1 Affinity Capture-Western
49977 MTA1 ESR1 Reconstituted Complex
49978 SIRT2 EP300 Affinity Capture-Western
49979 EP300 SIRT2 Affinity Capture-Western
49980 EP300 HDAC1 Affinity Capture-Western
49981 EP300 SIRT2 Biochemical Activity
49982 MIER1 CREBBP Reconstituted Complex
49983 SMARCA4 SIN3A Affinity Capture-Western
49984 SMARCA4 HDAC2 Affinity Capture-Western
49985 ESR1 NCOA6 Affinity Capture-Western
49986 ESR1 TOP2B Affinity Capture-Western
49987 ESR1 PRKDC Affinity Capture-Western
49988 ESR1 PARP1 Affinity Capture-Western
49989 ESR1 XRCC5 Affinity Capture-Western
49990 ESR1 XRCC6 Affinity Capture-Western
49991 PARP1 TOP2B Affinity Capture-Western
49992 PARP1 PRKDC Affinity Capture-Western
49993 PARP1 XRCC5 Affinity Capture-Western
49994 PARP1 XRCC6 Affinity Capture-Western
49995 SIRT3 XRCC6 Affinity Capture-Western
49996 SIRT3 XRCC6 Reconstituted Complex
49997 SIRT3 XRCC6 Biochemical Activity
49998 HDAC1 PAX3 Affinity Capture-Western
[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
g = gf.GraphFrame(VERTICES, EDGES)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
self._jvm_gf_api = _java_api(self._sc)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
提前致谢。
您可以在代码中设置 PYSPARK_SUBMIT_ARGS
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
)
spark = SparkSession.builder.getOrCreate()
或在PyCharm编辑运行配置(运行 -> 编辑配置 - > 选择配置 -> Select 配置选项卡 -> 选择环境变量 -> 添加PYSPARK_SUBMIT_ARGS):
用一个最小的工作示例:
import os
import sys
SPARK_HOME = ...
os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config
sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip"))
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])
from graphframes import *
g = GraphFrame(v, e)
g.inDegrees.show()
spark.stop()
您还可以将 packages
或 jars
添加到您的 spark-defaults.conf
。
如果您将 Python 3 与 graphframes
0.2 一起使用,则存在从 JAR 中提取 Python 库的已知问题,因此您必须手动执行此操作。例如,您可以下载 JAR 文件,将其解压缩,并确保带有 graphframes
的根目录在您的 Python 路径中。这已在 graphframes
0.3.
中修复
我花了将近 2 天的时间在互联网上滚动,但我无法解决这个问题。我正在尝试通过 PyCharm 将 graphframes package(版本:0.2.0-spark2.0-s_2.11)安装到 运行,但是,尽管我尽了最大努力,这是不可能的。
我几乎什么都试过了。请知道,在发布答案之前我也检查了这个网站 here。
这是我正在尝试的代码 运行:
# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd
# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME
# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.graphframes import GraphFrame
except ImportError as ex:
print "Can not import Spark Modules", ex
sys.exit(1)
# GLOBAL VARIABLES --------------------------------------------------------- -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)
# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":
# Main Path to CSV files
DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
FILE_NAME = 'gene_gene_associations_50k.csv'
# LOAD DATA CSV USING PANDAS -----------------------------------------------------------------#
print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
# Read csv file and load as df
GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))
# Remove duplicates
GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')
# Name Columns
GENES_DF_CLEAN.columns = ['gene_id']
# Output dataFrame
print GENES_DF_CLEAN
# Create vertices
VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)
# Show some vertices
print VERTICES.take(5)
print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
# Read csv file and load as df
EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
low_memory=True,
iterator=True,
chunksize=1000)
# Concatenate chunks into list & convert to dataFrame
EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))
# Name Columns
EDGES_DF.columns = ["src", "dst", "rel_type"]
# Output dataFrame
print EDGES_DF
# Create vertices
EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)
# Show some edges
print EDGES.take(5)
g = gf.GraphFrame(VERTICES, EDGES)
不用说,我已经尝试将 graphframes 目录(查看 here 以了解我所做的)包含到 spark 的 pyspark 目录中。但这似乎还不够......我尝试过的任何其他方法都失败了。将不胜感激一些帮助。您可以在下面看到我收到的错误消息:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
STEP 1: Loading Gene Nodes -------------------------------------------------------------
gene_id
0 MAP2K4
1 MYPN
2 ACVR1
3 GATA2
4 RPA2
5 ARF1
6 ARF3
8 XRN1
9 APP
10 APLP1
11 CITED2
12 EP300
13 APOB
14 ARRB2
15 CSF1R
16 PRRC2A
17 LSM1
18 SLC4A1
19 BCL3
20 ADRB1
21 BRCA1
25 ARVCF
26 PCBD1
27 PSEN2
28 CAPN3
29 ITPR1
30 MAGI1
31 RB1
32 TSG101
33 ORC1
... ...
49379 WDR26
49380 WDR5B
49382 NLE1
49383 WDR12
49385 WDR53
49386 WDR59
49387 WDR61
49409 CHD6
49422 DACT1
49424 KMT2B
49438 SMARCA1
49459 DCLRE1A
49469 F2RL1
49472 SENP8
49475 TSPY1
49479 SERPINB5
49521 HOXA11
49548 SYF2
49553 FOXN3
49557 MLANA
49608 REPIN1
49609 GMNN
49670 HIST2H2BE
49767 BCL7C
49797 SIRT3
49810 KLF4
49858 RHO
49896 MAGEA2
49907 SUV420H2
49958 SAP30L
[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
src dst rel_type
0 MAP2K4 FLNC Two-hybrid
1 MYPN ACTN2 Two-hybrid
2 ACVR1 FNTA Two-hybrid
3 GATA2 PML Two-hybrid
4 RPA2 STAT3 Two-hybrid
5 ARF1 GGA3 Two-hybrid
6 ARF3 ARFIP2 Two-hybrid
7 ARF3 ARFIP1 Two-hybrid
8 XRN1 ALDOA Two-hybrid
9 APP APPBP2 Two-hybrid
10 APLP1 DAB1 Two-hybrid
11 CITED2 TFAP2A Two-hybrid
12 EP300 TFAP2A Two-hybrid
13 APOB MTTP Two-hybrid
14 ARRB2 RALGDS Two-hybrid
15 CSF1R GRB2 Two-hybrid
16 PRRC2A GRB2 Two-hybrid
17 LSM1 NARS Two-hybrid
18 SLC4A1 SLC4A1AP Two-hybrid
19 BCL3 BARD1 Two-hybrid
20 ADRB1 GIPC1 Two-hybrid
21 BRCA1 ATF1 Two-hybrid
22 BRCA1 MSH2 Two-hybrid
23 BRCA1 BARD1 Two-hybrid
24 BRCA1 MSH6 Two-hybrid
25 ARVCF CDH15 Two-hybrid
26 PCBD1 CACNA1C Two-hybrid
27 PSEN2 CAPN1 Two-hybrid
28 CAPN3 TTN Two-hybrid
29 ITPR1 CA8 Two-hybrid
... ... ... ...
49969 SAP30 HDAC3 Affinity Capture-Western
49970 BRCA1 RBBP8 Co-localization
49971 BRCA1 BRCA1 Biochemical Activity
49972 SET TREX1 Co-purification
49973 SET TREX1 Reconstituted Complex
49974 PLAGL1 EP300 Reconstituted Complex
49975 PLAGL1 CREBBP Reconstituted Complex
49976 EP300 PLAGL1 Affinity Capture-Western
49977 MTA1 ESR1 Reconstituted Complex
49978 SIRT2 EP300 Affinity Capture-Western
49979 EP300 SIRT2 Affinity Capture-Western
49980 EP300 HDAC1 Affinity Capture-Western
49981 EP300 SIRT2 Biochemical Activity
49982 MIER1 CREBBP Reconstituted Complex
49983 SMARCA4 SIN3A Affinity Capture-Western
49984 SMARCA4 HDAC2 Affinity Capture-Western
49985 ESR1 NCOA6 Affinity Capture-Western
49986 ESR1 TOP2B Affinity Capture-Western
49987 ESR1 PRKDC Affinity Capture-Western
49988 ESR1 PARP1 Affinity Capture-Western
49989 ESR1 XRCC5 Affinity Capture-Western
49990 ESR1 XRCC6 Affinity Capture-Western
49991 PARP1 TOP2B Affinity Capture-Western
49992 PARP1 PRKDC Affinity Capture-Western
49993 PARP1 XRCC5 Affinity Capture-Western
49994 PARP1 XRCC6 Affinity Capture-Western
49995 SIRT3 XRCC6 Affinity Capture-Western
49996 SIRT3 XRCC6 Reconstituted Complex
49997 SIRT3 XRCC6 Biochemical Activity
49998 HDAC1 PAX3 Affinity Capture-Western
[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
g = gf.GraphFrame(VERTICES, EDGES)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
self._jvm_gf_api = _java_api(self._sc)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Process finished with exit code 1
提前致谢。
您可以在代码中设置 PYSPARK_SUBMIT_ARGS
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
)
spark = SparkSession.builder.getOrCreate()
或在PyCharm编辑运行配置(运行 -> 编辑配置 - > 选择配置 -> Select 配置选项卡 -> 选择环境变量 -> 添加PYSPARK_SUBMIT_ARGS):
用一个最小的工作示例:
import os
import sys
SPARK_HOME = ...
os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config
sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip"))
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])
from graphframes import *
g = GraphFrame(v, e)
g.inDegrees.show()
spark.stop()
您还可以将 packages
或 jars
添加到您的 spark-defaults.conf
。
如果您将 Python 3 与 graphframes
0.2 一起使用,则存在从 JAR 中提取 Python 库的已知问题,因此您必须手动执行此操作。例如,您可以下载 JAR 文件,将其解压缩,并确保带有 graphframes
的根目录在您的 Python 路径中。这已在 graphframes
0.3.