如何让AWS EMR集群上的Spark driver和executor自定义log4j.properties生效?

How to get custom log4j.properties to take effect for Spark driver and executor on AWS EMR cluster?

我有一个 AWS CLI 集群创建命令,我正在尝试修改它,以便它 使我的驱动程序和执行程序能够使用自定义的 log4j.properties 文件。和 Spark单机集群 我已经成功使用了使用的方法 --files 与设置一起切换 -Dlog4j.configuration= 通过指定 spark.driver.extraJavaOptions,以及 spark.executor.extraJavaOptions。

我尝试了许多不同的排列和变体,但还没有让它与 我在 AWS EMR 集群上 运行ning 的 Spark 作业。

我使用 AWS CLI 的 'create cluster' 命令和一个中间步骤,下载我的 spark jar,解压 它获取与该 .jar 打包在一起的 log4j.properties。然后我复制 log4j.properties 到我的 hdfs /tmp 文件夹并尝试通过“--files”分发该 log4j.properties 文件。

注意,我也尝试过不使用 hdfs(指定 --files log4j.properties 而不是 --files hdfs:///tmp/log4j.properties) 并且这也不起作用。

下面给出了此命令(使用 hdfs)的最新非工作版本。我想知道是否有人可以分享 一个真正有效的食谱。当我运行这个版本时,驱动程序的命令输出是:

log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$AppClassLoader@1e67b872.
log4j: Using URL [file:/etc/spark/conf.dist/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/etc/spark/conf.dist/log4j.properties
log4j: Parsing for [root] with value=[WARN,stdout].

从上面我可以看到我的 log4j.properties 文件没有被拾取(默认是)。 除了-Dlog4j.configuration=log4j.properties,我还尝试通过配置 -Dlog4j.configuration=classpath:log4j.properties(再次失败)。

非常感谢任何指导!

AWS 命令​​

jarPath=s3://com-acme/deployments/spark.jar
class=com.acme.SparkFoo


log4jConfigExtractCmd="aws s3 cp $jarPath /tmp/spark.jar ; cd /home/hadoop ; unzip /tmp/spark.jar log4j.properties ;  hdfs dfs -put log4j.properties /tmp/log4j.properties  "


aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark \
--tags 'Project=mouse' \
      'Owner=SwarmAnalytics'\
       'DatadogMonitoring=True'\
       'StreamMonitorRedshift=False'\
       'DeployRedshiftLoader=False'\
       'Environment=dev'\
       'DeploySpark=False'\
       'StreamMonitorS3=False'\
       'Name=CCPASixCore' \
--ec2-attributes '{"KeyName":"mouse-spark-2021","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-07039960","EmrManagedSlaveSecurityGroup":"sg-09c806ca38fd32353","EmrManagedMasterSecurityGroup":"sg-092288bbc8812371a"}' \
--release-label emr-5.27.0 \
--log-uri 's3n://log-foo' \
--steps '[{"Args":["bash","-c", "$log4jConfigExtractCmd"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"downloadSparkJar"},{"Args":["spark-submit","--files", "hdfs:///tmp/log4j.properties","--deploy-mode","client","--class","$class","--driver-memory","24G","--conf","spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:G1HeapRegionSize=256    -Dlog4j.debug -Dlog4j.configuration=log4j.properties","--conf","spark.yarn.executor.memoryOverhead=10g","--conf","spark.yarn.driver.memoryOverhead=10g","$jarPath"],"Type":"CUSTOM_JAR","ActionOnFailure":"CANCEL_AND_WAIT","Jar":"command-runner.jar","Properties":"","Name":"SparkFoo"}]'\
 --instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":2}]},"InstanceGroupType":"CORE","InstanceType":"r5d.4xlarge","Name":"Core - 6"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":4}]},"InstanceGroupType":"MASTER","InstanceType":"m5.2xlarge","Name":"Master - 1"}]' \
--configurations '[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR","log4j.logger.org.apache.hadoop":"ERROR","log4j.appender.stdout":"org.apache.log4j.ConsoleAppender","log4j.logger.io.netty":"ERROR","log4j.logger.org.apache.spark.scheduler.cluster":"ERROR","log4j.rootLogger":"WARN,stdout","log4j.appender.stdout.layout.ConversionPattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p/%c{1}:%L - %m%n","log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler":"INFO"}},{"Classification":"hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}},{"Classification":"spark-hive-site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'\
 --auto-terminate --ebs-root-volume-size 10 --service-role EMR_DefaultRole \
--security-configuration 'CCPA_dev_security_configuration_2' --enable-debugging --name 'SparkFoo' \
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region us-east-1 --profile sandbox

这是更改日志记录的方法。 AWS/EMR(我发现)的最佳方法是不要 fiddle 和

spark.driver.extraJavaOptions  or 
spark.executor.extraJavaOptions

相反,利用看起来像这样的配置块 >

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"INFO","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",

然后,假设您想将 类 在 com.foo 及其后代下完成的所有日志记录更改为 TRACE。然后将上面的块更改为如下所示 ->

[{"Classification":"spark-log4j","Properties":{"log4j.logger.org.apache.spark.cluster":"ERROR","log4j.logger.com.foo":"TRACE","log4j.logger.org.apache.zookeeper":"ERROR","log4j.appender.stdout.layout":"org.apache.log4j.PatternLayout","log4j.logger.org.apache.spark":"ERROR",