使用 AWS Python SDK boto3 列出 AWS Glue 从 table 解析的所有 S3 文件

List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3

我试图通过 Glue API docs 找到方法,但是没有与函数 get_table(**kwargs)get_tables(**kwargs).

相关的属性或方法

我想象类似于以下(伪)代码的东西:

client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
    for table in response['TableList']:
        files = table["files"]  # NOTE: the keyword "files" is invented
        # Do something else
        ...

据我从文档中可以看出,reponse["TableList"] 中的 table 应该是字典;然而 none 它的密钥似乎可以访问存储在其中的文件。

问题的解决方案是使用 awswrangler

以下函数检查数据库中的所有 AWS Glue 表,以查找 最近上传的文件 的特定列表。每当文件名匹配时,它将产生关联的 table 字典。这些生成的 table 是最近更新的。

def _yield_recently_updated_glue_tables(upload_path_list: List[str],
                                        db_name: str) -> Union(dict, None):
    """Check which tables have been updated recently.

    Args:
        upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
        db_name (str): name of the AWS Glue database

    Yields:
        Union(dict, None): AWS Glue table dictionaries recently updated
    """
    client = boto3.client('glue')
    paginator = client.get_paginator('get_tables')
    for response in paginator.paginate(DatabaseName=db_name):
        for table_dict in response['TableList']:
            table_name = table_dict['Name']
            s3_bucket_path = awswrangler.catalog.get_table_location(
                database=db_name, table=table_name)
            s3_filepaths = list(
                awswrangler.s3.describe_objects(s3_bucket_path).keys())
            table_was_updated = False
            for upload_file in upload_path_list:
                if upload_file in s3_filepaths:
                    table_was_updated = True
                    break
            if table_was_updated:
                yield table_dict