在 AWS S3 SelectObjectContent 的 S3 中获取 maxCharsPerRecord: 1,048,576
Getting maxCharsPerRecord: 1,048,576 in S3 in AWS S3 SelectObjectContent
我正在使用 s3 select 从 s3 JSON 文件中获取记录。当我从小 JSON 文件即 2MB(记录数约为 10000)
中获取数据时,一切都对我有用
Following is my query
innerStart = 1
innerStop = 100
maximumLimit = 100
query = "SELECT * FROM s3object r where r.id > " + str(innerStart) + " and r.id <= " + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
Bucket=cache,
Key= key + '.json',
ExpressionType='SQL',
Expression= query,
InputSerialization={'JSON': {"Type": "Lines"}, 'CompressionType': 'NONE'},
OutputSerialization={'JSON': {
}},
)
但是当我尝试从大型 JSON 文件(即超过 578496 条记录的 100 MB)中查询一些记录时。我收到以下错误。我尝试更改我的查询以从大型 JSON 文件中仅获取一条记录也对我不起作用。 S3 Select 是否有任何扫描字符限制?
File "./app/main.py", line 118, in retrieve_from_cache_json
OutputSerialization={'JSON': { File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357,
in _api_call
return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676,
in _make_api_call
raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (OverMaxRecordSize)
when calling the SelectObjectContent operation: The character number
in one record is more than our max threshold, maxCharsPerRecord:
1,048,576
Sample JSON file
{
"id": 1,
"hostname": "registry.in.",
"subtype": "A",
"value": "5.9.139.185",
"passive_dns_count": "4",
"count_total": 11,
"count": 11
}
{
"id": 2,
"hostname": "registry.ctn.in.",
"subtype": "A",
"value": "18.195.87.188",
"passive_dns_count": "2",
"count_total": 11,
"count": 11
}
"id": 3,
"hostname": "registry.in.",
"subtype": "NS",
"value": "ns-243.awsdns-30.com.",
"passive_dns_count": "6",
"count_total": 11,
"count": 11
}
...
...
我将 JSON 架构更改为 CSV,csv select 对我有用。以下是我的查询
innerStop = 100
innerStart = 0
maximumLimit = 100
query = "SELECT * FROM s3Object r WHERE cast(r.\"id\" as float) > " + str(innerStart) + " and cast(r.\"id\" as float) <=" + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
Bucket=cache,
Key= 'filename' + '.csv',
ExpressionType='SQL',
Expression= query,
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
我正在使用 s3 select 从 s3 JSON 文件中获取记录。当我从小 JSON 文件即 2MB(记录数约为 10000)
中获取数据时,一切都对我有用Following is my query
innerStart = 1
innerStop = 100
maximumLimit = 100
query = "SELECT * FROM s3object r where r.id > " + str(innerStart) + " and r.id <= " + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
Bucket=cache,
Key= key + '.json',
ExpressionType='SQL',
Expression= query,
InputSerialization={'JSON': {"Type": "Lines"}, 'CompressionType': 'NONE'},
OutputSerialization={'JSON': {
}},
)
但是当我尝试从大型 JSON 文件(即超过 578496 条记录的 100 MB)中查询一些记录时。我收到以下错误。我尝试更改我的查询以从大型 JSON 文件中仅获取一条记录也对我不起作用。 S3 Select 是否有任何扫描字符限制?
File "./app/main.py", line 118, in retrieve_from_cache_json OutputSerialization={'JSON': { File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (OverMaxRecordSize) when calling the SelectObjectContent operation: The character number in one record is more than our max threshold, maxCharsPerRecord: 1,048,576
Sample JSON file
{
"id": 1,
"hostname": "registry.in.",
"subtype": "A",
"value": "5.9.139.185",
"passive_dns_count": "4",
"count_total": 11,
"count": 11
}
{
"id": 2,
"hostname": "registry.ctn.in.",
"subtype": "A",
"value": "18.195.87.188",
"passive_dns_count": "2",
"count_total": 11,
"count": 11
}
"id": 3,
"hostname": "registry.in.",
"subtype": "NS",
"value": "ns-243.awsdns-30.com.",
"passive_dns_count": "6",
"count_total": 11,
"count": 11
}
...
...
我将 JSON 架构更改为 CSV,csv select 对我有用。以下是我的查询
innerStop = 100
innerStart = 0
maximumLimit = 100
query = "SELECT * FROM s3Object r WHERE cast(r.\"id\" as float) > " + str(innerStart) + " and cast(r.\"id\" as float) <=" + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
Bucket=cache,
Key= 'filename' + '.csv',
ExpressionType='SQL',
Expression= query,
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
OutputSerialization = {'CSV': {}},
)
for event in r['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')