使用 Hudi 时无法在 EMR 中的 AWS Glue 目录上 运行 spark.sql
Unable to run spark.sql on AWS Glue Catalog in EMR when using Hudi
我们的设置配置为我们在 AWS 上有一个默认数据湖,使用 S3 作为存储,使用 Glue Catalog 作为元存储。
我们开始使用 Apache Hudi,我们可以在 de AWS documentation 之后让它工作。问题是,当使用文档中指示的配置和 JAR 时,我们无法 运行 spark.sql
在我们的 Glue metastore 上。
下面是一些信息。
我们正在使用 boto3
创建集群:
emr.run_job_flow(
Name='hudi_debugging',
LogUri='s3n://mybucket/elasticmapreduce/',
ReleaseLabel='emr-5.28.0',
Applications=[
{
'Name': 'Spark'
},
{
'Name': 'Hive'
},
{
'Name': 'Hadoop'
}
],
Instances={
'InstanceGroups': [
{
'Name': 'Core Instance Group',
'Market': 'SPOT',
'InstanceCount': 3,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB': 16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge'
},
{
'Name': 'Master Instance Group',
'Market': 'ON_DEMAND',
'InstanceCount': 1,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB':16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge'
}
],
'Ec2KeyName': 'dataengineer',
'Ec2SubnetId': 'mysubnet',
'EmrManagedMasterSecurityGroup': 'mysg',
'EmrManagedSlaveSecurityGroup': 'mysg',
'KeepJobFlowAliveWhenNoSteps': True,
},
Configurations=[
{
'Classification': 'hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
'Classification': 'spark-hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python3",
"PYTHONIOENCODING": "utf8"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.execution.arrow.enabled": "true"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
],
BootstrapActions=[
{
'Name': 'Bootstrap action',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/bootstrap_emr.sh',
'Args': []
}
},
],
Steps=[
{
'Name': 'Setup Hadoop Debugging',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['state-pusher-script']
}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION',
VisibleToAllUsers=True
)
我们使用上面指出的文档中的示例启动 pyspark shell:
pyspark \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
然后,在 shell 中,当我们 运行 spark.sql("show tables")
时,我们得到以下错误:
Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> spark.sql("show tables").show()
21/04/09 19:49:21 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
... 44 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
我们还尝试使用 deploy-mode
client
和 cluster
将其作为一个步骤提交,并得到了类似的结果。
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists.apply(HiveExternalCatalog.scala:239)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
... 97 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/tmp/spark-cd3c9851-5edd-4c6f-abf8-2f4a0af807e3/spark_sql_test.py", line 7, in <module>
_ = spark.sql("select * from dl_raw.pagnet_clients limit 10")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
请在 github 中提出问题。com/apache/hudi/issues 以获得 hudi 社区的帮助。
spark.sql.hive.convertMetastoreParquet需要设置为真。
我们的设置配置为我们在 AWS 上有一个默认数据湖,使用 S3 作为存储,使用 Glue Catalog 作为元存储。
我们开始使用 Apache Hudi,我们可以在 de AWS documentation 之后让它工作。问题是,当使用文档中指示的配置和 JAR 时,我们无法 运行 spark.sql
在我们的 Glue metastore 上。
下面是一些信息。
我们正在使用 boto3
创建集群:
emr.run_job_flow(
Name='hudi_debugging',
LogUri='s3n://mybucket/elasticmapreduce/',
ReleaseLabel='emr-5.28.0',
Applications=[
{
'Name': 'Spark'
},
{
'Name': 'Hive'
},
{
'Name': 'Hadoop'
}
],
Instances={
'InstanceGroups': [
{
'Name': 'Core Instance Group',
'Market': 'SPOT',
'InstanceCount': 3,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB': 16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'CORE',
'InstanceType': 'm5.xlarge'
},
{
'Name': 'Master Instance Group',
'Market': 'ON_DEMAND',
'InstanceCount': 1,
'EbsConfiguration': {'EbsBlockDeviceConfigs': [
{'VolumeSpecification': {'SizeInGB':16, 'VolumeType': 'gp2'},
'VolumesPerInstance': 1}]},
'InstanceRole': 'MASTER',
'InstanceType': 'm5.xlarge'
}
],
'Ec2KeyName': 'dataengineer',
'Ec2SubnetId': 'mysubnet',
'EmrManagedMasterSecurityGroup': 'mysg',
'EmrManagedSlaveSecurityGroup': 'mysg',
'KeepJobFlowAliveWhenNoSteps': True,
},
Configurations=[
{
'Classification': 'hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
'Classification': 'spark-hive-site',
'Properties': {
'hive.metastore.client.factory.class': 'com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYSPARK_DRIVER_PYTHON": "/usr/bin/python3",
"PYTHONIOENCODING": "utf8"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.execution.arrow.enabled": "true"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
],
BootstrapActions=[
{
'Name': 'Bootstrap action',
'ScriptBootstrapAction': {
'Path': 's3://mybucket/bootstrap_emr.sh',
'Args': []
}
},
],
Steps=[
{
'Name': 'Setup Hadoop Debugging',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['state-pusher-script']
}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
ScaleDownBehavior='TERMINATE_AT_TASK_COMPLETION',
VisibleToAllUsers=True
)
我们使用上面指出的文档中的示例启动 pyspark shell:
pyspark \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.hive.convertMetastoreParquet=false" \
--jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
然后,在 shell 中,当我们 运行 spark.sql("show tables")
时,我们得到以下错误:
Using Python version 3.7.9 (default, Aug 27 2020 21:59:41)
SparkSession available as 'spark'.
>>> spark.sql("show tables").show()
21/04/09 19:49:21 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
... 44 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
我们还尝试使用 deploy-mode
client
和 cluster
将其作为一个步骤提交,并得到了类似的结果。
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.sql.
: org.apache.spark.sql.AnalysisException: java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:128)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:238)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:117)
...
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists.apply(HiveExternalCatalog.scala:239)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
... 97 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/tmp/spark-cd3c9851-5edd-4c6f-abf8-2f4a0af807e3/spark_sql_test.py", line 7, in <module>
_ = spark.sql("select * from dl_raw.pagnet_clients limit 10")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: 'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
请在 github 中提出问题。com/apache/hudi/issues 以获得 hudi 社区的帮助。
spark.sql.hive.convertMetastoreParquet需要设置为真。