当主节点上的驱动程序 运行 时,Spark History Server 非常慢
Spark History Server very slow when driver running on master node
我在 AWS EMR 5.30.0 上使用 Spark 2.4.5 运行ning 和 r5.4xlarge 实例(16 个 vCore,128 GiB 内存,仅 EBS 存储,EBS Storage:256 GiB) : 1 master, 1 core 和 30 task.
我在主节点上启动了 Spark Thrift Server,它是集群上唯一 运行ning 的作业
sudo /usr/lib/spark/sbin/start-thriftserver.sh --conf spark.blacklist.enabled=true --conf spark.blacklist.stage.maxFailedExecutorsPerNode=4 --conf spark.blacklist.task.maxTaskAttemptsPerNode=3 --conf spark.driver.cores=12 --conf spark.driver.maxResultSize=10g --conf spark.driver.memory=86000M --conf spark.driver.memoryOverhead=10240 --conf spark.kryoserializer.buffer.max=768m --conf spark.rpc.askTimeout=700 --conf spark.sql.broadcastTimeout=800 --conf spark.sql.sources.partitionOverwriteMode=dynamic --conf spark.task.maxFailures=20
然后我用 JDBC 启动 SQL 查询,但是当大量查询 运行ning 时,UI 变得非常慢。我想如果我把spark.driver.cores=12(master节点有16个)和spark.driver.memory=86000M(有128GB内存)给master节点留一些余量就可以了能够 运行 历史服务器等其他进程,但它仍然很慢。
所以我想我可以编辑其他设置以使 UI 正常工作,但我不确定是什么。
这些是集群中 spark-defaults.conf 的设置,仅供参考:
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.sql.warehouse.dir hdfs:///user/spark/warehouse
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.yarn.historyServer.address <xxxxx>:18080
spark.history.ui.port 18080
spark.shuffle.service.enabled true
spark.yarn.dist.files /etc/spark/conf/hive-site.xml
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.dynamicAllocation.enabled true
spark.blacklist.decommissioning.enabled true
spark.blacklist.decommissioning.timeout 1h
spark.resourceManager.cleanupExpiredHost true
spark.stage.attempt.ignoreOnDecommissionFetchFailure true
spark.decommissioning.timeout.threshold 20
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.hadoop.yarn.timeline-service.enabled false
spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS $(hostname -f)
spark.files.fetchFailure.unRegisterOutputOnHost true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem true
spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000
spark.sql.parquet.output.committer.class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.parquet.fs.optimized.committer.optimization-enabled true
spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions
spark.sql.sources.partitionOverwriteMode dynamic
spark.executor.instances 1
spark.executor.cores 16
spark.driver.memory 2048M
spark.executor.memory 109498M
spark.default.parallelism 32
spark.emr.maximizeResourceAllocation true```
问题是只有 1 个核心实例,因为日志保存在 HDFS 中,所以这个实例成为瓶颈。
我添加了另一个核心实例,现在好多了。
另一种解决方案是将日志保存到 S3/S3A 而不是 HDFS,在 spark-defaults.conf 中更改这些参数(确保它们也在 UI 配置中更改)但它可能需要添加一些 JAR 文件才能工作。
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
我在 AWS EMR 5.30.0 上使用 Spark 2.4.5 运行ning 和 r5.4xlarge 实例(16 个 vCore,128 GiB 内存,仅 EBS 存储,EBS Storage:256 GiB) : 1 master, 1 core 和 30 task.
我在主节点上启动了 Spark Thrift Server,它是集群上唯一 运行ning 的作业
sudo /usr/lib/spark/sbin/start-thriftserver.sh --conf spark.blacklist.enabled=true --conf spark.blacklist.stage.maxFailedExecutorsPerNode=4 --conf spark.blacklist.task.maxTaskAttemptsPerNode=3 --conf spark.driver.cores=12 --conf spark.driver.maxResultSize=10g --conf spark.driver.memory=86000M --conf spark.driver.memoryOverhead=10240 --conf spark.kryoserializer.buffer.max=768m --conf spark.rpc.askTimeout=700 --conf spark.sql.broadcastTimeout=800 --conf spark.sql.sources.partitionOverwriteMode=dynamic --conf spark.task.maxFailures=20
然后我用 JDBC 启动 SQL 查询,但是当大量查询 运行ning 时,UI 变得非常慢。我想如果我把spark.driver.cores=12(master节点有16个)和spark.driver.memory=86000M(有128GB内存)给master节点留一些余量就可以了能够 运行 历史服务器等其他进程,但它仍然很慢。
所以我想我可以编辑其他设置以使 UI 正常工作,但我不确定是什么。
这些是集群中 spark-defaults.conf 的设置,仅供参考:
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native spark.eventLog.enabled true spark.eventLog.dir hdfs:///var/log/spark/apps spark.history.fs.logDirectory hdfs:///var/log/spark/apps spark.sql.warehouse.dir hdfs:///user/spark/warehouse spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2 spark.yarn.historyServer.address <xxxxx>:18080 spark.history.ui.port 18080 spark.shuffle.service.enabled true spark.yarn.dist.files /etc/spark/conf/hive-site.xml spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' spark.dynamicAllocation.enabled true spark.blacklist.decommissioning.enabled true spark.blacklist.decommissioning.timeout 1h spark.resourceManager.cleanupExpiredHost true spark.stage.attempt.ignoreOnDecommissionFetchFailure true spark.decommissioning.timeout.threshold 20 spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' spark.hadoop.yarn.timeline-service.enabled false spark.yarn.appMasterEnv.SPARK_PUBLIC_DNS $(hostname -f) spark.files.fetchFailure.unRegisterOutputOnHost true spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem true spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds 2000 spark.sql.parquet.output.committer.class com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter spark.sql.parquet.fs.optimized.committer.optimization-enabled true spark.sql.emr.internal.extensions com.amazonaws.emr.spark.EmrSparkSessionExtensions spark.sql.sources.partitionOverwriteMode dynamic spark.executor.instances 1 spark.executor.cores 16 spark.driver.memory 2048M spark.executor.memory 109498M spark.default.parallelism 32 spark.emr.maximizeResourceAllocation true```
问题是只有 1 个核心实例,因为日志保存在 HDFS 中,所以这个实例成为瓶颈。 我添加了另一个核心实例,现在好多了。
另一种解决方案是将日志保存到 S3/S3A 而不是 HDFS,在 spark-defaults.conf 中更改这些参数(确保它们也在 UI 配置中更改)但它可能需要添加一些 JAR 文件才能工作。
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps