How to solve the runtime error: graphframes not found

How to solve the runtime error: graphframes not found

我用的是pyspark中的graphframes框架,用了一段时间正常到运行(之前用过graphframes模块),但是过了一段时间就报错:"No module named 'graphframes' ".

这种错误是偶尔的,有时他能完成运行,有时不能。

pyspar-版本:2.2.1

图框:0.6

错误:

19/06/05 02:22:17 ERROR Executor: Exception in task 641.3 in stage 216.0 (TID 123244)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 166, in main
   func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 55, in read_command
    command = serializer._read_with_length(file)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/data/data08/nm-local-dir/usercache/hduser0011/appcache/application_1547810698423_82435/container_1547810698423_82435_02_000041/ares_detect.zip/ares_detect/task/communication_detect.py", line 11, in <module>
    from graphframes import GraphFrame
ModuleNotFoundError: No module named 'graphframes'

命令:

spark-submit --master yarn-cluster \
        --name ad_com_detect_${app_arr[$i]}_${scenario_arr[$i]}_${txParameter_app_arr[$i]} \
        --executor-cores 4 \
        --num-executors 8 \
        --executor-memory 35g \
        --driver-memory 2g \
        --conf spark.sql.shuffle.partitions=800 \
        --conf spark.default.parallelism=1000 \
        --conf spark.yarn.executor.memoryOverhead=2048 \
        --conf spark.sql.execution.arrow.enabled=true \
        --jars org.scala-lang_scala-reflect-2.10.4.jar,\
org.slf4j_slf4j-api-1.7.7.jar,\
com.typesafe.scala-logging_scala-logging-api_2.10-2.1.2.jar,\
com.typesafe.scala-logging_scala-logging-slf4j_2.10-2.1.2.jar,\
graphframes-0.6.0-spark2.2-s_2.11.jar \
        --py-files ***.zip \
***/***/****.py  &

当 pyspark 运行内存不足时,它会删除这些 jar 吗?

尝试通过 package 命令添加 jar。

spark-submit \
    --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11  \
      my_py_script.py

它也可以同时使用两个参数

spark-submit \
    --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11  \
    --jars patth_to_your_jars/graphframes-0.7.0-spark2.4-s_2.11.jar \
    my_py_script.py

这解决了我的问题

一般来说,有 4 个命令可以将文件添加到 Spark,这些命令在 spark-submit --help

--jars JARS            Comma-separated list of jars to include on the driver and executor classpaths.

--packages             Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories.

--py-files PY_FILES    Comma-separated list of .zip, .egg, or .pyfiles to place on the PYTHONPATH for Python apps.

--files FILES          Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).