AWS Athena alb 日志：每天每分钟获取请求 url 的最大点击数

Question

我正在尝试从 alb 日志中获取每天请求 URL 的每分钟最大点击次数（吞吐量）。我使用 table 投影来划分 table。试图找出查询以获得过去 1-3 年所有网址的每分钟最大点击次数的预期结果。结果应该是这样的（只是一个示例，时间戳可以是任何格式）

Timestamp	Url	Max Hits Per Min
12-29-2019 8:01 AM	url1	10720
12-29-2019 10:35 AM	url2	21329
12-29-2019 10:35 AM	url3	37420
12-30-2019 11:53 AM	url1	5898
12-30-2019 01:30 PM	url2	14230
12-30-2019 05:19 PM	url3	20000

table 创建查询：

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
        type string,
        time string,
        elb string,
        client_ip string,
        client_port int,
        target_ip string,
        target_port int,
        request_processing_time double,
        target_processing_time double,
        response_processing_time double,
        elb_status_code string,
        target_status_code string,
        received_bytes bigint,
        sent_bytes bigint,
        request_verb string,
        request_url string,
        request_proto string,
        user_agent string,
        ssl_cipher string,
        ssl_protocol string,
        target_group_arn string,
        trace_id string,
        domain_name string,
        chosen_cert_arn string,
        matched_rule_priority string,
        request_creation_time string,
        actions_executed string,
        redirect_url string,
        lambda_error_reason string,
        target_port_list string,
        target_status_code_list string,
        classification string,
        classification_reason string
        )
        PARTITIONED BY ( `partition_date` string)
        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
        WITH SERDEPROPERTIES (
        'serialization.format' = '1',
        'input.regex' = 
    '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"')
        LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/';
        TBLPROPERTIES ('projection.enabled'='true', 
        'projection.partition_date.format'='yyyy/MM/dd', 
        'projection.partition_date.interval'='1', 
        'projection.partition_date.interval.unit'='DAYS', 
        'projection.partition_date.range'='2018/01/01,NOW', 
        'projection.partition_date.type'='date', 
        'storage.location.template'='s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${partition_date}')

Answer 1

你可以试试：

with cte as (
   select date_trunc('minute',timestamp) as minute, url, count(*) as hits_per_minute from mytable
group by 1,2
)
select max_by(minute, hits_per_minute) as timestamp, url, max(hits_per_minute) from cte
group by date_trunc('day', minute), url

说明：常见的 table 表达式 (cte) 将计算每 url 每分钟的点击次数，然后从中提取达到最大点击的分钟数（使用 max_by 函数）和最大点击数，按 day 和 url.

分组

查看以下文档：

max_by 函数: https://prestodb.io/docs/current/functions/aggregate.html#id2
date_trunc 函数: https://prestodb.io/docs/current/functions/datetime.html

AWS Athena alb 日志：每天每分钟获取请求 url 的最大点击数

AWS Athena alb logs: getting max hits per minute to the request url each day

hive

amazon-web-services

amazon-elb

presto

amazon-athena