使用 Hudi 时无法在 EMR 中的 AWS Glue 目录上 运行 spark.sql

Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi

我们的设置配置为我们在 AWS 上有一个默认数据湖,使用 S3 作为存储,使用 Glue Catalog 作为元存储。

我们开始使用 Apache Hudi,我们可以在 de AWS documentation 之后让它工作。问题是,当使用文档中指示的配置和 JAR 时,我们无法 运行 spark.sql 在我们的 Glue metastore 上。

下面是一些信息。

我们正在使用 boto3 创建集群:

emr.run_job_flow(
    Name='hudi_debugging',
    LogUri='s3n://mybucket/elasticmapreduce/',
    ReleaseLabel='emr-5.28.0',
    Applications=[
        {
            'Name': 'Spark'
        },
        {
            'Name': 'Hive'
        },
        {
            'Name': 'Hadoop'
        }
    ],
    Instances={
        'InstanceGroups': [
            {
                'Name': 'Core Instance Group',
                'Market': 'SPOT',
                'InstanceCount': 3,
                'EbsConfiguration': {'EbsBlockDeviceConfigs': [
                    {'VolumeSpecification': {'SizeInGB': 16, 'VolumeType': 'gp2'},
                     'VolumesPerInstance': 1}]},
                'InstanceRole': 'CORE',
                'InstanceType': 'm5.xlarge'
            },
            {
                'Name': 'Master Instance Group',
                'Market': 'ON_DEMAND',
                'InstanceCount': 1,
                'EbsConfiguration': {'EbsBlockDeviceConfigs': [
                    {'VolumeSpecification': {'SizeInGB':16, 'VolumeType': 'gp2'},
                     'VolumesPerInstance': 1}]},
                'InstanceRole': 'MASTER',
                'InstanceType': 'm5.xlarge'
            }
        ],
        'Ec2KeyName': 'dataengineer',
        'Ec2SubnetId': 'mysubnet',
        'EmrManagedMasterSecurityGroup': 'mysg',
        'EmrManagedSlaveSecurityGroup': 'mysg',
        'KeepJobFlowAliveWhenNoSteps': True,
    },
    Configurations=[
        {
            'Classification': 'hive-site',
            'Properties': {
                'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
        },
        {
            'Classification': 'spark-hive-site',
            'Properties': {
                'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
        },
        {
            "Classification": "spark-env",
            "Configurations": [
                {
                    "Classification": "export",
                    "Properties": {
                        "PYSPARK_PYTHON": "/usr/bin/python3",
                        "PYSPARK_DRIVER_PYTHON": "/usr/bin/python3",
                        "PYTHONIOENCODING": "utf8"
                    }
                }
            ]
        },
        {
            "Classification": "spark-defaults",
            "Properties": {
                "spark.sql.execution.arrow.enabled": "true"
            }
        },
        {
            "Classification": "spark",
            "Properties": {
                "maximizeResourceAllocation": "true"
            }
        }
    ],
    BootstrapActions=[
        {
            'Name': 'Bootstrap action',
            'ScriptBootstrapAction': {
                'Path': 's3://mybucket/bootstrap_emr.sh',
                'Args': []
            }
        },
    ],
    Steps=[
        {
            'Name': 'Setup Hadoop Debugging',
            'ActionOnFailure': 'TERMINATE_CLUSTER',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['state-pusher-script']
            }
        }
    ],
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION',
    VisibleToAllUsers=True
)

我们使用上面指出的文档中的示例启动 pyspark shell:

pyspark \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

然后,在 shell 中,当我们 运行 spark.sql("show tables") 时,我们得到以下错误:

Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> spark.sql("show tables").show()
21/04/09 19:49:21 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
        at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
        ...
    ... 44 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'

我们还尝试使用 deploy-mode clientcluster 将其作为一个步骤提交,并得到了类似的结果。

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
    ...
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists.apply(HiveExternalCatalog.scala:239)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
    ... 97 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/tmp/spark-cd3c9851-5edd-4c6f-abf8-2f4a0af807e3/spark_sql_test.py", line 7, in <module>
    _ = spark.sql("select * from dl_raw.pagnet_clients limit 10")
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'

请在 github 中提出问题。com/apache/hudi/issues 以获得 hudi 社区的帮助。

spark.sql.hive.convertMetastoreParquet需要设置为真。