无法在 Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;'] 中查询 AWS Glue/Athena 视图

Not able to query AWS Glue/Athena views in Databricks Runtime ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;']

正在尝试读取在 AWS Athena 上创建的 view(基于指向 S3 镶木地板文件的 Glue table)使用 pyspark 在 Databricks 集群上抛出以下错误,原因不明:

java.lang.IllegalArgumentException: Can not create a Path from an empty string;

第一个假设是缺少访问权限,但事实并非如此。

在继续研究时,我发现以下 Databricks 的 post 关于此问题的原因:https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system

我想出了一个 python 脚本来解决这个问题。事实证明,出现此异常是因为 Athena 和 Presto 以不同于 Databricks Runtime 和 Spark 期望的格式存储视图的元数据。您需要通过 Spark

重新创建视图

Python 带有执行示例的脚本示例:

import boto3
import time


def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
    athena = boto3.client("athena", region_name=aws_region)
    res = athena.start_query_execution(QueryString=query, ResultConfiguration={
        'OutputLocation': athenaOutputPath})
    execution_id = res["QueryExecutionId"]
    while True:
        res = athena.get_query_execution(QueryExecutionId=execution_id)
        state = res["QueryExecution"]["Status"]["State"]
        if state == "SUCCEEDED":
            return
        if state in ["FAILED", "CANCELLED"]:
            raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
        time.sleep(1)


def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
    glue = boto3.client("glue", region_name=aws_region)
    glue.delete_table(DatabaseName=db, Name=table)
    create_view_sql = f"create view {db}.{table} as {query}"
    execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
    presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
        "ViewOriginalText"
    ]
    glue.delete_table(DatabaseName=db, Name=table)

    spark_session.sql(create_view_sql).show()
    spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
    for key in [
        "DatabaseName",
        "CreateTime",
        "UpdateTime",
        "CreatedBy",
        "IsRegisteredWithLakeFormation",
        "CatalogId",
    ]:
        if key in spark_view:
            del spark_view[key]
    spark_view["ViewOriginalText"] = presto_schema
    spark_view["Parameters"]["presto_view"] = "true"
    spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)


create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")

再次注意,此脚本保持您的视图与Glue/Athena兼容。

参考文献: