s3 中的 JMESPath 过滤
JMESPath filtering in s3
我在 s3 中有一个如下所示的对象:
{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}
page_iterator 中的页面如下所示:
PAGE: {'ResponseMetadata': {'HTTPStatusCode': 200, 'HTTPHeaders': {}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_connectionlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_notvalidname_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_userlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()),
我正在尝试像这样进行过滤:
page_iterator = paginator.paginate(**operation_parameters)
print(f"FILTER: {filter}")
# filtered_iterator = page_iterator.search(filter) if filter else page_iterator
for page in page_iterator:
print(f"PAGE: {page}")
for obj in page.get("Contents", []):
print(f"OBJECT: {obj}")
yield obj
但我没有取回对象。我在 search
中做的 JMESPath 过滤器错了吗?我要按照这些 docs
我的过滤器是这样的:
"Contents[?Key[?contains(@, 'useractivitylog') == `true`]]"
我哪里做错了?
文档可能非常混乱。这是一个很好的参考,但即使这样也可能有点冗长。 https://opensourceconnections.com/blog/2015/07/27/advanced-aws-cli-jmespath-query/
下面有三个过滤器,只是 comment/uncomment 不同的行,看看它们如何输出数据。
此外,这会遍历整个存储桶,因此可能很耗时。
bucket='new-bucket-for-lists'
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket)
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')] ")
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')][Key, LastModified] ")
filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')].LastModified ")
for key_data in filtered_iterator:
print(key_data)
计时不同的 AWS api 和 jmespath 实现。
我使用了一个文件夹和前缀,其中大约有 1500 个对象,并测试了检索所有对象与筛选集的对比。令人惊讶的是,也许 list_objects
端点比 list_objects_v2
端点慢得多。
使用 jmespath 只比使用 python 列表理解遍历页面稍微好一点。最后,拉取所有数据,然后进行过滤。也许对于更大的目录,结果会更可观。
%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
# print(page)
# bucket_object_paths = jmespath.search('Contents[*].Key', page)
bucket_object_paths = jmespath.search("Contents[?contains(Key, 'straddles')].Key", page)
keys_list.extend(bucket_object_paths)
len(keys_list)
# 450 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 368 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered
%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
# keys = [content['Key'] for content in page.get('Contents')]
keys = [content['Key'] for content in page.get('Contents') if 'straddles' in content['Key']]
# print('keys in page: ', len(keys))
keys_list.extend(keys)
len(keys_list)
# 448 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 398 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered
%%timeit
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/')
keys_list = page_iterator.search("Contents[?contains(Key, 'straddles')].Key ")
# keys_list = page_iterator.search("Contents[*].Key ")
len(list(keys_list))
# 948 ms ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 885 ms ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我在 s3 中有一个如下所示的对象:
{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}
page_iterator 中的页面如下所示:
PAGE: {'ResponseMetadata': {'HTTPStatusCode': 200, 'HTTPHeaders': {}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_connectionlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_notvalidname_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_useractivitylog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()), 'ETag': '"60377db54e3bbcfe7d569b8ea029cfa3-1"', 'Size': 7, 'StorageClass': 'STANDARD'}, {'Key': '1111_redshift_us-east-1_dev-ue1-rs-analytics_userlog_2021-05-01T20:18.gz', 'LastModified': datetime.datetime(2021, 5, 24, 19, 14, 40, tzinfo=tzutc()),
我正在尝试像这样进行过滤:
page_iterator = paginator.paginate(**operation_parameters)
print(f"FILTER: {filter}")
# filtered_iterator = page_iterator.search(filter) if filter else page_iterator
for page in page_iterator:
print(f"PAGE: {page}")
for obj in page.get("Contents", []):
print(f"OBJECT: {obj}")
yield obj
但我没有取回对象。我在 search
中做的 JMESPath 过滤器错了吗?我要按照这些 docs
我的过滤器是这样的:
"Contents[?Key[?contains(@, 'useractivitylog') == `true`]]"
我哪里做错了?
文档可能非常混乱。这是一个很好的参考,但即使这样也可能有点冗长。 https://opensourceconnections.com/blog/2015/07/27/advanced-aws-cli-jmespath-query/
下面有三个过滤器,只是 comment/uncomment 不同的行,看看它们如何输出数据。
此外,这会遍历整个存储桶,因此可能很耗时。
bucket='new-bucket-for-lists'
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket)
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')] ")
# filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')][Key, LastModified] ")
filtered_iterator = page_iterator.search("Contents[?contains(Key, '.py')].LastModified ")
for key_data in filtered_iterator:
print(key_data)
计时不同的 AWS api 和 jmespath 实现。
我使用了一个文件夹和前缀,其中大约有 1500 个对象,并测试了检索所有对象与筛选集的对比。令人惊讶的是,也许 list_objects
端点比 list_objects_v2
端点慢得多。
使用 jmespath 只比使用 python 列表理解遍历页面稍微好一点。最后,拉取所有数据,然后进行过滤。也许对于更大的目录,结果会更可观。
%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
# print(page)
# bucket_object_paths = jmespath.search('Contents[*].Key', page)
bucket_object_paths = jmespath.search("Contents[?contains(Key, 'straddles')].Key", page)
keys_list.extend(bucket_object_paths)
len(keys_list)
# 450 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 368 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered
%%timeit
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
# keys = [content['Key'] for content in page.get('Contents')]
keys = [content['Key'] for content in page.get('Contents') if 'straddles' in content['Key']]
# print('keys in page: ', len(keys))
keys_list.extend(keys)
len(keys_list)
# 448 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 398 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - filtered
%%timeit
client = boto3.client('s3')
paginator = client.get_paginator('list_objects')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/')
keys_list = page_iterator.search("Contents[?contains(Key, 'straddles')].Key ")
# keys_list = page_iterator.search("Contents[*].Key ")
len(list(keys_list))
# 948 ms ± 170 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - 1460 objects
# 885 ms ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)