如何在不耗尽资源的情况下使用 Amazon Athena 执行查询?
How to execute a query with Amazon Athena without exhausted resources?
我正在尝试执行此查询以获取一些数据。我在 s3 中的 s3 上的文件大小://my_datalake/my_table/year=2018/month=9/day=7/ 是 1.1 TB,我有 10014 snappy.parquet 个对象。
SELECT array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints
FROM my_datalake.my_table
WHERE year=2018 and month=09 and day=07
and ts between timestamp '2018-09-07 00:00:00' and timestamp '2018-09-07 23:59:59'
and status = '2'
GROUP BY domain, size, device_id, ip
但是我得到了那个错误:
Query exhausted resources at this scale factor
(Run time: 6 minutes 41 seconds, Data scanned: 153.87GB)
我有分区 YEAR、MONTH、DAY 和 HOUR。我该如何做这个查询?我可以使用 Amazon Athena 执行此操作还是需要使用其他工具?
我的 table 的架构是:
CREATE EXTERNAL TABLE `ssp_request_prueba`(
`version` string,
`adunit` string,
`adunit_original` string,
`brand` string,
`country` string,
`device_connection_type` string,
`device_density` string,
`device_height` string,
`device_id` string,
`device_type` string,
`device_width` string,
`domain` string,
`endpoint` string,
`endpoint_version` string,
`external_dfp_id` string,
`id_req` string,
`ip` string,
`lang` string,
`lat` string,
`lon` string,
`model` string,
`ncc` string,
`noc` string,
`non` string,
`os` string,
`osv` string,
`scc` string,
`sim_operator_code` string,
`size` string,
`soc` string,
`son` string,
`source` string,
`ts` timestamp,
`user_agent` string,
`status` string,
`delivery_network` string,
`delivery_time` string,
`delivery_status` string,
`delivery_network_key` string,
`delivery_price` string,
`device_id_original` string,
`tracking_limited` string,
`from_cache` string,
`request_price` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my_datalake/my_table'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1538747353')
问题可能与 array_join 和 array_agg 函数有关。我想在这种情况下,Athena 服务中节点的内存限制已被超出。结合这些功能,Athena 可能无法管理如此大量的数据。
我正在尝试执行此查询以获取一些数据。我在 s3 中的 s3 上的文件大小://my_datalake/my_table/year=2018/month=9/day=7/ 是 1.1 TB,我有 10014 snappy.parquet 个对象。
SELECT array_join(array_agg(distinct endpoint),',') as endpoints_all, count(endpoint) as count_endpoints
FROM my_datalake.my_table
WHERE year=2018 and month=09 and day=07
and ts between timestamp '2018-09-07 00:00:00' and timestamp '2018-09-07 23:59:59'
and status = '2'
GROUP BY domain, size, device_id, ip
但是我得到了那个错误:
Query exhausted resources at this scale factor
(Run time: 6 minutes 41 seconds, Data scanned: 153.87GB)
我有分区 YEAR、MONTH、DAY 和 HOUR。我该如何做这个查询?我可以使用 Amazon Athena 执行此操作还是需要使用其他工具?
我的 table 的架构是:
CREATE EXTERNAL TABLE `ssp_request_prueba`(
`version` string,
`adunit` string,
`adunit_original` string,
`brand` string,
`country` string,
`device_connection_type` string,
`device_density` string,
`device_height` string,
`device_id` string,
`device_type` string,
`device_width` string,
`domain` string,
`endpoint` string,
`endpoint_version` string,
`external_dfp_id` string,
`id_req` string,
`ip` string,
`lang` string,
`lat` string,
`lon` string,
`model` string,
`ncc` string,
`noc` string,
`non` string,
`os` string,
`osv` string,
`scc` string,
`sim_operator_code` string,
`size` string,
`soc` string,
`son` string,
`source` string,
`ts` timestamp,
`user_agent` string,
`status` string,
`delivery_network` string,
`delivery_time` string,
`delivery_status` string,
`delivery_network_key` string,
`delivery_price` string,
`device_id_original` string,
`tracking_limited` string,
`from_cache` string,
`request_price` string)
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my_datalake/my_table'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1538747353')
问题可能与 array_join 和 array_agg 函数有关。我想在这种情况下,Athena 服务中节点的内存限制已被超出。结合这些功能,Athena 可能无法管理如此大量的数据。