Pyspark 集群模式异常 - Java 网关进程在向驱动程序发送其端口号之前退出
Pyspark cluster mode exception - Java gateway process exited before sending the driver its port number
在 apache airflow 中,我写了一个 PythonOperator,它使用 pyspark 运行 纱线集群模式的作业。我按如下方式初始化 sparksession 对象。
spark = SparkSession \
.builder \
.appName("test python operator") \
.master("yarn") \
.config("spark.submit.deployMode","cluster") \
.getOrCreate()
然而,当我 运行 我的爸爸时,我得到一个异常。
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/catfish/dags/dags_dag_test_python_operator.py", line 39, in print_count
spark = SparkSession \
File "/usr/local/lib/python3.8/dist-packages/pyspark/sql/session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 371, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 128, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 320, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
我也设置了PYSPARK_SUBMIT_ARGS,但是对我不起作用!
您需要在 ubuntu 容器上安装 spark。
RUN apt-get -y install default-jdk scala git curl wget
RUN wget --no-verbose https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.6-bin-hadoop2.7.tgz
RUN mv spark-2.4.6-bin-hadoop2.7 /opt/spark
ENV SPARK_HOME=/opt/spark
不幸的是,您无法 运行 使用 PythonOperator 在 yarn 上产生火花。我建议你使用 SparkSubmitOperator or BashOperator.
在 apache airflow 中,我写了一个 PythonOperator,它使用 pyspark 运行 纱线集群模式的作业。我按如下方式初始化 sparksession 对象。
spark = SparkSession \
.builder \
.appName("test python operator") \
.master("yarn") \
.config("spark.submit.deployMode","cluster") \
.getOrCreate()
然而,当我 运行 我的爸爸时,我得到一个异常。
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 983, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/catfish/dags/dags_dag_test_python_operator.py", line 39, in print_count
spark = SparkSession \
File "/usr/local/lib/python3.8/dist-packages/pyspark/sql/session.py", line 186, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 371, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 128, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 320, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/usr/local/lib/python3.8/dist-packages/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
我也设置了PYSPARK_SUBMIT_ARGS,但是对我不起作用!
您需要在 ubuntu 容器上安装 spark。
RUN apt-get -y install default-jdk scala git curl wget
RUN wget --no-verbose https://downloads.apache.org/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
RUN tar xvf spark-2.4.6-bin-hadoop2.7.tgz
RUN mv spark-2.4.6-bin-hadoop2.7 /opt/spark
ENV SPARK_HOME=/opt/spark
不幸的是,您无法 运行 使用 PythonOperator 在 yarn 上产生火花。我建议你使用 SparkSubmitOperator or BashOperator.