AWS Athena alb 日志:每天每分钟获取请求 url 的最大点击数
AWS Athena alb logs: getting max hits per minute to the request url each day
我正在尝试从 alb 日志中获取每天请求 URL 的每分钟最大点击次数(吞吐量)。我使用 table 投影来划分 table。试图找出查询以获得过去 1-3 年所有网址的每分钟最大点击次数的预期结果。
结果应该是这样的(只是一个示例,时间戳可以是任何格式)
Timestamp
Url
Max Hits Per Min
12-29-2019 8:01 AM
url1
10720
12-29-2019 10:35 AM
url2
21329
12-29-2019 10:35 AM
url3
37420
12-30-2019 11:53 AM
url1
5898
12-30-2019 01:30 PM
url2
14230
12-30-2019 05:19 PM
url3
20000
table 创建查询:
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
type string,
time string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code string,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
PARTITIONED BY ( `partition_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' =
'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/';
TBLPROPERTIES ('projection.enabled'='true',
'projection.partition_date.format'='yyyy/MM/dd',
'projection.partition_date.interval'='1',
'projection.partition_date.interval.unit'='DAYS',
'projection.partition_date.range'='2018/01/01,NOW',
'projection.partition_date.type'='date',
'storage.location.template'='s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${partition_date}')
你可以试试:
with cte as (
select date_trunc('minute',timestamp) as minute, url, count(*) as hits_per_minute from mytable
group by 1,2
)
select max_by(minute, hits_per_minute) as timestamp, url, max(hits_per_minute) from cte
group by date_trunc('day', minute), url
说明:
常见的 table 表达式 (cte
) 将计算每 url 每分钟的点击次数,然后从中提取达到最大点击的分钟数(使用 max_by
函数)和最大点击数,按 day
和 url
.
分组
查看以下文档:
我正在尝试从 alb 日志中获取每天请求 URL 的每分钟最大点击次数(吞吐量)。我使用 table 投影来划分 table。试图找出查询以获得过去 1-3 年所有网址的每分钟最大点击次数的预期结果。 结果应该是这样的(只是一个示例,时间戳可以是任何格式)
Timestamp | Url | Max Hits Per Min |
---|---|---|
12-29-2019 8:01 AM | url1 | 10720 |
12-29-2019 10:35 AM | url2 | 21329 |
12-29-2019 10:35 AM | url3 | 37420 |
12-30-2019 11:53 AM | url1 | 5898 |
12-30-2019 01:30 PM | url2 | 14230 |
12-30-2019 05:19 PM | url3 | 20000 |
table 创建查询:
CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
type string,
time string,
elb string,
client_ip string,
client_port int,
target_ip string,
target_port int,
request_processing_time double,
target_processing_time double,
response_processing_time double,
elb_status_code string,
target_status_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
request_url string,
request_proto string,
user_agent string,
ssl_cipher string,
ssl_protocol string,
target_group_arn string,
trace_id string,
domain_name string,
chosen_cert_arn string,
matched_rule_priority string,
request_creation_time string,
actions_executed string,
redirect_url string,
lambda_error_reason string,
target_port_list string,
target_status_code_list string,
classification string,
classification_reason string
)
PARTITIONED BY ( `partition_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' =
'([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/';
TBLPROPERTIES ('projection.enabled'='true',
'projection.partition_date.format'='yyyy/MM/dd',
'projection.partition_date.interval'='1',
'projection.partition_date.interval.unit'='DAYS',
'projection.partition_date.range'='2018/01/01,NOW',
'projection.partition_date.type'='date',
'storage.location.template'='s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${partition_date}')
你可以试试:
with cte as (
select date_trunc('minute',timestamp) as minute, url, count(*) as hits_per_minute from mytable
group by 1,2
)
select max_by(minute, hits_per_minute) as timestamp, url, max(hits_per_minute) from cte
group by date_trunc('day', minute), url
说明:
常见的 table 表达式 (cte
) 将计算每 url 每分钟的点击次数,然后从中提取达到最大点击的分钟数(使用 max_by
函数)和最大点击数,按 day
和 url
.
查看以下文档: