使用 AWS Python SDK boto3 列出 AWS Glue 从 table 解析的所有 S3 文件
List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3
我试图通过 Glue API docs 找到方法,但是没有与函数 get_table(**kwargs)
或 get_tables(**kwargs)
.
相关的属性或方法
我想象类似于以下(伪)代码的东西:
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
for table in response['TableList']:
files = table["files"] # NOTE: the keyword "files" is invented
# Do something else
...
据我从文档中可以看出,reponse["TableList"]
中的 table
应该是字典;然而 none 它的密钥似乎可以访问存储在其中的文件。
问题的解决方案是使用 awswrangler。
以下函数检查数据库中的所有 AWS Glue
表,以查找 最近上传的文件 的特定列表。每当文件名匹配时,它将产生关联的 table 字典。这些生成的 table 是最近更新的。
def _yield_recently_updated_glue_tables(upload_path_list: List[str],
db_name: str) -> Union(dict, None):
"""Check which tables have been updated recently.
Args:
upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
db_name (str): name of the AWS Glue database
Yields:
Union(dict, None): AWS Glue table dictionaries recently updated
"""
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_name):
for table_dict in response['TableList']:
table_name = table_dict['Name']
s3_bucket_path = awswrangler.catalog.get_table_location(
database=db_name, table=table_name)
s3_filepaths = list(
awswrangler.s3.describe_objects(s3_bucket_path).keys())
table_was_updated = False
for upload_file in upload_path_list:
if upload_file in s3_filepaths:
table_was_updated = True
break
if table_was_updated:
yield table_dict
我试图通过 Glue API docs 找到方法,但是没有与函数 get_table(**kwargs)
或 get_tables(**kwargs)
.
我想象类似于以下(伪)代码的东西:
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_input_shared):
for table in response['TableList']:
files = table["files"] # NOTE: the keyword "files" is invented
# Do something else
...
据我从文档中可以看出,reponse["TableList"]
中的 table
应该是字典;然而 none 它的密钥似乎可以访问存储在其中的文件。
问题的解决方案是使用 awswrangler。
以下函数检查数据库中的所有 AWS Glue
表,以查找 最近上传的文件 的特定列表。每当文件名匹配时,它将产生关联的 table 字典。这些生成的 table 是最近更新的。
def _yield_recently_updated_glue_tables(upload_path_list: List[str],
db_name: str) -> Union(dict, None):
"""Check which tables have been updated recently.
Args:
upload_path_list (List[str]): contains all S3-filepaths of recently uploaded files
db_name (str): name of the AWS Glue database
Yields:
Union(dict, None): AWS Glue table dictionaries recently updated
"""
client = boto3.client('glue')
paginator = client.get_paginator('get_tables')
for response in paginator.paginate(DatabaseName=db_name):
for table_dict in response['TableList']:
table_name = table_dict['Name']
s3_bucket_path = awswrangler.catalog.get_table_location(
database=db_name, table=table_name)
s3_filepaths = list(
awswrangler.s3.describe_objects(s3_bucket_path).keys())
table_was_updated = False
for upload_file in upload_path_list:
if upload_file in s3_filepaths:
table_was_updated = True
break
if table_was_updated:
yield table_dict