将图框与 PyCharm 一起使用

Using graphframes with PyCharm

我花了将近 2 天的时间在互联网上滚动,但我无法解决这个问题。我正在尝试通过 PyCharm 将 graphframes package(版本:0.2.0-spark2.0-s_2.11)安装到 运行,但是,尽管我尽了最大努力,这是不可能的。

我几乎什么都试过了。请知道,在发布答案之前我也检查了这个网站 here

这是我正在尝试的代码 运行:

# IMPORT OTHER LIBS --------------------------------------------------------
import os
import sys
import pandas as pd

# IMPORT SPARK ------------------------------------------------------------------------------------#
# Path to Spark source folder
USER_FILE_PATH = "/Users/<username>"
SPARK_PATH = "/PycharmProjects/GenesAssociation"
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7"
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE
os.environ['SPARK_HOME'] = SPARK_HOME

# Append pySpark to Python Path
sys.path.append(SPARK_HOME + "/python")
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf
    from pyspark.sql import SQLContext
    from pyspark.graphframes import GraphFrame

except ImportError as ex:
    print "Can not import Spark Modules", ex
    sys.exit(1)

# GLOBAL VARIABLES ---------------------------------------------------------    -----------------------#
SC = SparkContext('local')
SQL_CONTEXT = SQLContext(SC)

# MAIN CODE ---------------------------------------------------------------------------------------#
if __name__ == "__main__":

    # Main Path to CSV files
    DATA_PATH = '/PycharmProjects/GenesAssociation/data/'
    FILE_NAME = 'gene_gene_associations_50k.csv'

    # LOAD DATA CSV USING  PANDAS -----------------------------------------------------------------#
    print "STEP 1: Loading Gene Nodes -------------------------------------------------------------"
    # Read csv file and load as df
    GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
                        usecols=['OFFICIAL_SYMBOL_A'],
                        low_memory=True,
                        iterator=True,
                        chunksize=1000)

    # Concatenate chunks into list & convert to dataFrame
    GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True))

    # Remove duplicates
    GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first')

    # Name Columns
    GENES_DF_CLEAN.columns = ['gene_id']

    # Output dataFrame
    print GENES_DF_CLEAN

    # Create vertices
    VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN)

    # Show some vertices
    print VERTICES.take(5)

    print "STEP 2: Loading Gene Edges -------------------------------------------------------------"
    # Read csv file and load as df
    EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME,
                        usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'],
                        low_memory=True,
                        iterator=True,
                        chunksize=1000)

    # Concatenate chunks into list & convert to dataFrame
    EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True))

    # Name Columns
    EDGES_DF.columns = ["src", "dst", "rel_type"]

    # Output dataFrame
    print EDGES_DF

    # Create vertices
    EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF)

    # Show some edges
    print EDGES.take(5)

    g = gf.GraphFrame(VERTICES, EDGES)

不用说,我已经尝试将 graphframes 目录(查看 here 以了解我所做的)包含到 spark 的 pyspark 目录中。但这似乎还不够......我尝试过的任何其他方法都失败了。将不胜感激一些帮助。您可以在下面看到我收到的错误消息:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040.     Attempting port 4041.

STEP 1: Loading Gene Nodes -------------------------------------------------------------
         gene_id
0         MAP2K4
1           MYPN
2          ACVR1
3          GATA2
4           RPA2
5           ARF1
6           ARF3
8           XRN1
9            APP
10         APLP1
11        CITED2
12         EP300
13          APOB
14         ARRB2
15         CSF1R
16        PRRC2A
17          LSM1
18        SLC4A1
19          BCL3
20         ADRB1
21         BRCA1
25         ARVCF
26         PCBD1
27         PSEN2
28         CAPN3
29         ITPR1
30         MAGI1
31           RB1
32        TSG101
33          ORC1
...          ...
49379      WDR26
49380      WDR5B
49382       NLE1
49383      WDR12
49385      WDR53
49386      WDR59
49387      WDR61
49409       CHD6
49422      DACT1
49424      KMT2B
49438    SMARCA1
49459    DCLRE1A
49469      F2RL1
49472      SENP8
49475      TSPY1
49479   SERPINB5
49521     HOXA11
49548       SYF2
49553      FOXN3
49557      MLANA
49608     REPIN1
49609       GMNN
49670  HIST2H2BE
49767      BCL7C
49797      SIRT3
49810       KLF4
49858        RHO
49896     MAGEA2
49907   SUV420H2
49958     SAP30L

[6025 rows x 1 columns]
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB.
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')]
STEP 2: Loading Gene Edges -------------------------------------------------------------
           src       dst                  rel_type
0       MAP2K4      FLNC                Two-hybrid
1         MYPN     ACTN2                Two-hybrid
2        ACVR1      FNTA                Two-hybrid
3        GATA2       PML                Two-hybrid
4         RPA2     STAT3                Two-hybrid
5         ARF1      GGA3                Two-hybrid
6         ARF3    ARFIP2                Two-hybrid
7         ARF3    ARFIP1                Two-hybrid
8         XRN1     ALDOA                Two-hybrid
9          APP    APPBP2                Two-hybrid
10       APLP1      DAB1                Two-hybrid
11      CITED2    TFAP2A                Two-hybrid
12       EP300    TFAP2A                Two-hybrid
13        APOB      MTTP                Two-hybrid
14       ARRB2    RALGDS                Two-hybrid
15       CSF1R      GRB2                Two-hybrid
16      PRRC2A      GRB2                Two-hybrid
17        LSM1      NARS                Two-hybrid
18      SLC4A1  SLC4A1AP                Two-hybrid
19        BCL3     BARD1                Two-hybrid
20       ADRB1     GIPC1                Two-hybrid
21       BRCA1      ATF1                Two-hybrid
22       BRCA1      MSH2                Two-hybrid
23       BRCA1     BARD1                Two-hybrid
24       BRCA1      MSH6                Two-hybrid
25       ARVCF     CDH15                Two-hybrid
26       PCBD1   CACNA1C                Two-hybrid
27       PSEN2     CAPN1                Two-hybrid
28       CAPN3       TTN                Two-hybrid
29       ITPR1       CA8                Two-hybrid
...        ...       ...                       ...
49969    SAP30     HDAC3  Affinity Capture-Western
49970    BRCA1     RBBP8           Co-localization
49971    BRCA1     BRCA1      Biochemical Activity
49972      SET     TREX1           Co-purification
49973      SET     TREX1     Reconstituted Complex
49974   PLAGL1     EP300     Reconstituted Complex
49975   PLAGL1    CREBBP     Reconstituted Complex
49976    EP300    PLAGL1  Affinity Capture-Western
49977     MTA1      ESR1     Reconstituted Complex
49978    SIRT2     EP300  Affinity Capture-Western
49979    EP300     SIRT2  Affinity Capture-Western
49980    EP300     HDAC1  Affinity Capture-Western
49981    EP300     SIRT2      Biochemical Activity
49982    MIER1    CREBBP     Reconstituted Complex
49983  SMARCA4     SIN3A  Affinity Capture-Western
49984  SMARCA4     HDAC2  Affinity Capture-Western
49985     ESR1     NCOA6  Affinity Capture-Western
49986     ESR1     TOP2B  Affinity Capture-Western
49987     ESR1     PRKDC  Affinity Capture-Western
49988     ESR1     PARP1  Affinity Capture-Western
49989     ESR1     XRCC5  Affinity Capture-Western
49990     ESR1     XRCC6  Affinity Capture-Western
49991    PARP1     TOP2B  Affinity Capture-Western
49992    PARP1     PRKDC  Affinity Capture-Western
49993    PARP1     XRCC5  Affinity Capture-Western
49994    PARP1     XRCC6  Affinity Capture-Western
49995    SIRT3     XRCC6  Affinity Capture-Western
49996    SIRT3     XRCC6     Reconstituted Complex
49997    SIRT3     XRCC6      Biochemical Activity
49998    HDAC1      PAX3  Affinity Capture-Western

[49999 rows x 3 columns]
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB.
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')]
Traceback (most recent call last):
  File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module>
    g = gf.GraphFrame(VERTICES, EDGES)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__
    self._jvm_gf_api = _java_api(self._sc)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api
    return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)


Process finished with exit code 1

提前致谢。

您可以在代码中设置 PYSPARK_SUBMIT_ARGS

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell"
)
spark = SparkSession.builder.getOrCreate()

或在PyCharm编辑运行配置(运行 -> 编辑配置 - > 选择配置 -> Select 配置选项卡 -> 选择环境变量 -> 添加PYSPARK_SUBMIT_ARGS):

用一个最小的工作示例:

import os
import sys

SPARK_HOME = ...
os.environ["SPARK_HOME"] = SPARK_HOME
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config

sys.path.append(os.path.join(SPARK_HOME, "python"))
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip"))

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

v = spark.createDataFrame([("a",  "foo"), ("b", "bar"),], ["id", "attr"])
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"])


from graphframes import *

g = GraphFrame(v, e)
g.inDegrees.show()

spark.stop()

您还可以将 packagesjars 添加到您的 spark-defaults.conf

如果您将 Python 3 与 graphframes 0.2 一起使用,则存在从 JAR 中提取 Python 库的已知问题,因此您必须手动执行此操作。例如,您可以下载 JAR 文件,将其解压缩,并确保带有 graphframes 的根目录在您的 Python 路径中。这已在 graphframes 0.3.

中修复