Sagemaker Studio Pyspark 示例失败

Sagemaker Studio Pyspark example fails

当我尝试 运行 时,Sagemaker 在 Sagemaker Studio 中提供了 PySpark 示例

import os

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import sagemaker
from sagemaker import get_execution_role
import sagemaker_pyspark

role = get_execution_role()

# Configure Spark to use the SageMaker Spark dependency jars
jars = sagemaker_pyspark.classpath_jars()

classpath = ":".join(sagemaker_pyspark.classpath_jars())

# See the SageMaker Spark Github repo under sagemaker-pyspark-sdk
# to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)\
    .master("local[*]").getOrCreate()

我得到以下异常:

    ---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-6-c8f6fff0daaf> in <module>
     19 # to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.
     20 spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)\
---> 21     .master("local[*]").getOrCreate()

/opt/conda/lib/python3.6/site-packages/pyspark/sql/session.py in getOrCreate(self)
    171                     for key, value in self._options.items():
    172                         sparkConf.set(key, value)
--> 173                     sc = SparkContext.getOrCreate(sparkConf)
    174                     # This SparkContext may be an existing one.
    175                     for key, value in self._options.items():

/opt/conda/lib/python3.6/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    361         with SparkContext._lock:
    362             if SparkContext._active_spark_context is None:
--> 363                 SparkContext(conf=conf or SparkConf())
    364             return SparkContext._active_spark_context
    365 

/opt/conda/lib/python3.6/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    127                     " note this option will be removed in Spark 3.0")
    128 
--> 129         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    130         try:
    131             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/opt/conda/lib/python3.6/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    310         with SparkContext._lock:
    311             if not SparkContext._gateway:
--> 312                 SparkContext._gateway = gateway or launch_gateway(conf)
    313                 SparkContext._jvm = SparkContext._gateway.jvm
    314 

/opt/conda/lib/python3.6/site-packages/pyspark/java_gateway.py in launch_gateway(conf)
     44     :return: a JVM gateway
     45     """
---> 46     return _launch_gateway(conf)
     47 
     48 

/opt/conda/lib/python3.6/site-packages/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

在 运行 示例之前,我从笔记本上安装了 pyspark 和 sagemaker_pyspark。我也在使用 SageMaker 内核库中的 SparkMagic 内核。

也许,您遇到此问题是因为此笔记本设计为 运行 当您拥有 EMR 集群时。我建议您在 Sagemaker 上启动一个带有 conda_python3 内核的笔记本,而不是 SparkMagic 内核。您需要使用 pip 安装 pysparksagemaker_pyspark,但它应该适用于您发布的代码。

您还可以使用 Studio 中默认提供的 SparkMagic 内核。该内核包含使用 sparkmagic 连接 EMR 集群并提交 spark 代码或 运行 SQL 查询的所有库。

请参阅以下博客 post,了解如何将 SparkMagic 内核与 EMR 结合使用: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-studio-notebooks-backed-by-spark-in-amazon-emr/