Amazon Athena 不解析云端日志

Question

我正在关注 Athena getting started guide 并尝试解析我自己的 Cloudfront 日志。但是，这些字段未被解析。

我用了一个小测试文件，如下：

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02  07:57:45    LHR5    5001    86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /foo    404 -   Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Error   -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA==    d3g47gpj5mj0b.cloudfront.net    https421    0.076   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02  07:57:45    LHR5    1158241 86.177.253.38   GET d3g47gpj5mj0b.cloudfront.net    /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo    Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36   -   -   Miss    oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA==    d3g47gpj5mj0b.cloudfront.net    https   419 0.440   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss

并用 SQL 创建了 table：

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES (
  "input.regex" = "^(?!#)([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
  ) LOCATION 's3://test/athena-csv/'

但是没有数据返回：

我可以看到它 returns 4 行，但应该排除前 2 行，因为它们以 # 开头，所以好像正则表达式没有被正确解析。

我做错了什么吗？还是正则表达式错误（似乎不太可能，因为它在文档中，对我来说看起来不错）？

Answer 1

雅典娜是case insensitive and considers every column to be lowercase。尝试定义您的 Athena table 并改为使用小写字段名称进行查询。

Answer 2

该演示版也不适合我。玩了一会儿之后，我得到了以下结果：

CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$'
) LOCATION 's3://bucket/logs/';

用您的信息替换 bucket/log 和 dbname.table。出于某种原因，它仍在为带# 的行插入空行，但我得到了其余数据。

我认为下一步是尝试为用户代理或 cookie 制作一个。

Answer 3

用这个解决我的问题，并改进@CoderDans 的回答：

秘诀是使用 \t 代替正则表达式的 \s 来分隔值。

CREATE EXTERNAL TABLE IF NOT EXISTS mytablename (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_request_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` int,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol_version` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$'
) LOCATION 's3://mybucket/myprefix/';

Answer 4

这就是我最终得到的结果：

CREATE EXTERNAL TABLE logs (
  `date` date,
  `time` string,
  `location` string,
  `bytes` int,
  `request_ip` string,
  `method` string,
  `host` string,
  `uri` string,
  `status` int,
  `referer` string,
  `useragent` string,
  `uri_query` string,
  `cookie` string,
  `edge_type` string,
  `edget_requiest_id` string,
  `host_header` string,
  `cs_protocol` string,
  `cs_bytes` int,
  `time_taken` string,
  `x_forwarded_for` string,
  `ssl_protocol` string,
  `ssl_cipher` string,
  `result_type` string,
  `protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#.*)(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*(\S*)'
) LOCATION 's3://logs'

注意双反斜杠是有意的。

云端日志的格式在某些时候发生了变化以添加 protocol。这会处理旧文件和新文件。

Answer 5

其实这里所有的答案都有一个小错误：第4个字段必须是BIGINT，而不是INT。否则您的 >2GB 文件请求不会被正确解析。在与 AWS 业务支持长时间讨论后，正确的格式似乎是：

CREATE EXTERNAL TABLE your_table_name (
  `Date` DATE,
  Time STRING,
  Location STRING,
  SCBytes BIGINT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  UserAgent STRING,
  UriQS STRING,
  Cookie STRING,
  ResultType STRING,
  RequestId STRING,
  HostHeader STRING,
  Protocol STRING,
  CSBytes BIGINT,
  TimeTaken FLOAT,
  XForwardFor STRING,
  SSLProtocol STRING,
  SSLCipher STRING,
  ResponseResultType STRING,
  CSProtocolVersion STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://path_to_your_data_directory'
TBLPROPERTIES ('skip.header.line.count' = '2')

Answer 6

这个对我有用。我开始 here，但我必须添加 "protocol" 列。

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
  `date` DATE,
  time STRING,
  location STRING,
  bytes BIGINT,
  request_ip STRING,
  method STRING,
  host STRING,
  uri STRING,
  status INT,
  referrer STRING,
  user_agent STRING,
  query_string STRING,
  cookie STRING,
  result_type STRING,
  request_id STRING,
  host_header STRING,
  request_protocol STRING,
  request_bytes BIGINT,
  time_taken FLOAT,
  xforwarded_for STRING,
  ssl_protocol STRING,
  ssl_cipher STRING,
  response_result_type STRING,
  http_version STRING,
  fle_status STRING,
  fle_encrypted_fields INT,
  protocol string
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/prefix/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )

Answer 7

我想分享我在新 AWS 帐户上遇到的关于此问题的奇怪行为。

来自 reference docs at AWS 的示例或正则表达式方法对我来说效果不佳，所以我再次尝试了 Athena 控制台向导（创建->从数据源创建一个 table->s3 存储桶数据）和令人惊讶的是它起作用了。

我检查了 wizard-generated table 的 DDL (SHOW CREATE TABLE table_name;)，唯一明显的区别是 TBLPROPERTIES.has_encrypted_data 设置为 false。

所以首先我假设 has_encrypted_data 的默认值可能是 true 因为我查询的存储桶没有加密设置，所以我测试了多次并将此设置设置为 true 和 false 并注意到两个条件都工作正常。

再次，出于困惑，我删除了这个属性，（再次使用 AWS 参考文档的示例），你猜怎么着？成功了。

所以我假设这个问题可能与 AWS 端的错误有关。

Amazon Athena 不解析云端日志

Amazon Athena not parsing cloudfront logs

amazon-web-services

amazon-athena