无法在 Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;'] 中查询 AWS Glue/Athena 视图
Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']
正在尝试读取在 AWS Athena 上创建的 view
(基于指向 S3 镶木地板文件的 Glue table)使用 pyspark
在 Databricks 集群上抛出以下错误,原因不明:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
第一个假设是缺少访问权限,但事实并非如此。
在继续研究时,我发现以下 Databricks 的 post 关于此问题的原因:https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
我想出了一个 python 脚本来解决这个问题。事实证明,出现此异常是因为 Athena 和 Presto 以不同于 Databricks Runtime 和 Spark 期望的格式存储视图的元数据。您需要通过 Spark
重新创建视图
Python 带有执行示例的脚本示例:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
再次注意,此脚本保持您的视图与Glue/Athena兼容。
参考文献:
正在尝试读取在 AWS Athena 上创建的 view
(基于指向 S3 镶木地板文件的 Glue table)使用 pyspark
在 Databricks 集群上抛出以下错误,原因不明:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
第一个假设是缺少访问权限,但事实并非如此。
在继续研究时,我发现以下 Databricks 的 post 关于此问题的原因:https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
我想出了一个 python 脚本来解决这个问题。事实证明,出现此异常是因为 Athena 和 Presto 以不同于 Databricks Runtime 和 Spark 期望的格式存储视图的元数据。您需要通过 Spark
重新创建视图Python 带有执行示例的脚本示例:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
再次注意,此脚本保持您的视图与Glue/Athena兼容。
参考文献: