pyspark 任务卡在带有 Docker-compose 的 Airflow 和 Spark 独立集群上

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

我在 docker-compose 上设置了 Airflow 和 Spark 独立集群。 Airflow 运行 spark-submit 任务通过spark client方式,直接提交给spark master。但是当我执行 spark-submit 任务时,任务卡住了。

Spark 提交命令:

spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py

我从 spark-submit 驱动程序日志中看到的内容:

22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

我从一位 spark 工作者那里看到的:

spark-worker-1_1  | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1  | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1  | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker@172.27.0.11:35093"

版本:

气流图像:apache/airflow:2.2.3

Spark 驱动版本:3.1.2

星火服务器:3.2.0

网络

所有容器 airflow-scheduler、airflow-webserver、spark-master、spark-worker-n 连接到同一外部网络。

spark-driver安装在airflow容器(scheduler、webserver)下,因为对应的dags和tasks都是由airflow-scheduler执行的

更新

更换驱动spark版本匹配master的3.2.0后,问题消失。所以这意味着,在我的特定情况下,问题不是由于不同的 spark 参与者(驱动程序、主程序、worker/executor)之间的连接性引起的,而是由于版本不匹配引起的。由于某些原因,spark worker 没有记录相应的错误,这是一种误导。

大多数线程都指向连接问题。然而,在我的案例中,问题是由于 spark 的驱动程序与 master/worker 版本不匹配。

更换驱动程序 spark 版本以匹配 master 的 3.2.0,并确保驱动程序和执行程序端 (3.9.10) 的 python 版本相同后,问题就消失了。所以这意味着,在我的特定情况下,问题不是由于不同的 spark 参与者(驱动程序、主程序、worker/executor)之间的连接性造成的,而是由于版本不匹配造成的。由于某些原因,spark worker 没有记录相应的错误,这是一种误导。