Athena 如何知道如何对数据进行分区?

How does Athena know how to partition your data?

I've been reading this aws blog article 在谈到分区的部分之前,它对我来说很有意义。它用于创建 table 的查询如下所示:

CREATE EXTERNAL TABLE IF NOT EXISTS elb_logs_raw_native_part (
  request_timestamp string, 
  elb_name string, 
  request_ip string, 
  request_port int, 
  backend_ip string, 
  backend_port int, 
  request_processing_time double, 
  backend_processing_time double, 
  client_response_time double, 
  elb_response_code string, 
  backend_response_code string, 
  received_bytes bigint, 
  sent_bytes bigint, 
  request_verb string, 
  url string, 
  protocol string, 
  user_agent string, 
  ssl_cipher string, 
  ssl_protocol string ) 
PARTITIONED BY(year string, month string, day string) -- Where does Athena get this data?
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
         'serialization.format' = '1','input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:\-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\"([^ ]*) ([^ ]*) (- |[^ ]*)\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://athena-examples/elb/raw/';

让我感到困惑的是它的说法,"partition by year (among other things)",但 "SQL" 中没有其他地方指定了一年的数据部分。此外,这些列名称中的 none 具有日期类型。那么当你没有告诉它数据的哪一部分是年、月或日时,Athena 怎么知道如何划分这些数据?

在博客文章的上下文中,它说年份来自文件名,但没有任何步骤告诉 Athena 该信息。文章说这是预定义的格式:https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html#access-log-entry-format 但没有我可以看到的年份列。

编辑:这篇文章对此不是很明确,但我认为可能是说每个 PARTITIONED BY 列都是 s3 存储桶中的子目录?换句话说,PARTITION BY 子句中的第一个元素(在本例中为 year)是存储桶的第一个子目录,依此类推。

这对我来说只有部分意义,因为同一篇文章说,"You can partition your data across multiple dimensions―e.g., month, week, day, hour, or customer ID―or all of them together."我不明白如果它们来自子目录,你怎么能做到所有这些,除非你有大量的重复在你的桶里。

This article,已链接,比原文解释得更好。

To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data. There are two scenarios discussed in the following sections:

  1. Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.

  2. Data is not partitioned.

我的问题是关于 1 号的。为此,它说

Partitions are stored in separate folders in Amazon S3. For example, here is the partial listing for sample ad impressions:

aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/

PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
PRE dt=2009-04-12-14-10/
PRE dt=2009-04-12-14-15/
PRE dt=2009-04-12-14-20/
PRE dt=2009-04-12-15-00/
PRE dt=2009-04-12-15-05/ 

Here, logs are stored with the column name (dt) set equal to date, hour, and minute increments. When you give a DDL with the location of the parent folder, the schema, and the name of the partitioned column, Athena can query data in those subfolders.

文章 (IMO) 的失败在于它从未显示 aws s3 ls。如果是这样,我就不会感到困惑了。在文章中,假设有一个名为年、月和日的 S3 密钥。 PARTITION BY 指的是那些键。

如果您的文件没有像这样整齐地组织,您可以使用不同的 sql 语句将其读入并分区(上述情况 2):

ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/'

找到丹尼尔非常有趣!这让我想起了我与 AWS 支持部门就此主题进行的一次旧讨论。我想在这里 post 摘录它,也许有人觉得它有用:

I just read the Athena documentation about partitioning data in S3 [1].
I wonder about the sample which is given in "Scenario 1: Data already partitioned and stored on S3 in hive format", "Storing Partitioned Data":
the command "aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/" returns e.g. "PRE dt=2009-04-12-13-00/".

Is my assumption correct, that in order to be able to partition the data in Athena automatically, I have to prefix my S3 folder names with "partition_key=actual_folder_name"?
Otherwise I do not understand why the example ls-command above returns S3 keys which start with the "dt=" prefix. I think it should be better documented at this point in the Athena documentation what "data on S3 in hive format" means. [...]

References:
[1] https://docs.aws.amazon.com/athena/latest/ug/partitions.html

回答

I understand that you have some questions about partitioning in Athena as per AWS documentation https://docs.aws.amazon.com/athena/latest/ug/partitions.html

To answer your question: Is my assumption correct, that in order to be able to partition the data in Athena automatically, I have to prefix my S3 folder names with "partition_key=actual_folder_name"?

Yes, you are correct that in order to detect the partitions automatically by Athena, the S3 prefixes should be in 'key=value' pair. Otherwise you have to add all those partitions manually by using 'Alter Table .. Add Partition' command as mentioned in the above documentation itself.

I understand that you found our documentation to be less verbose with respect to Hive style partitioning. [...] However, the reason behind less description about Hive partitioning is due to the fact that Hive being an open source tool have open source documentation available explaining Hive style partitioning in detail. For e.g. link[1], etc.

If you find changing S3 naming or adding partitions manually a tedious task at your end due to its manual nature, I can suggest you using AWS Glue crawler[2] to create a Athena table on your S3 data. Glue will detect the partitions even in non-Hive style partitioning and will assign Keys to the partitions like 'partition_0', 'partition_1', etc. [...]

References:
[1] http://hadooptutorial.info/partitioning-in-hive/
[2] https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html