如何查询分区的 AWS Athena table
How to query on AWS Athena partitioned table
问题总结
When I attempts to SELECT
query the partitioned table with WHERE
clause, Athena produce an error.
有4种分区,在我的log
table.
- 年
string
- 月
string
- 天
string
- 小时
string
我在分区 table 上尝试 SELECT
查询。
但是收到以下错误消息。
错误信息
GENERIC_INTERNAL_ERROR: No value present
This query ran against the "default" database, unless qualified by the query.
SELECT 我试过的查询
SELECT *
FROM logs
WHERE year='2020'
AND month='10'
AND day ='05';
和
SELECT *
FROM "default"."logs"
WHERE year='2020'
AND month='10'
AND day ='05';
由于有关 No value present
的错误消息,我检查了分区结果。
SHOW PARTITIONS logs;
结果
year=2020/month=10/day=05/hour=17
year=2020/month=10/day=05/hour=11
year=2020/month=10/day=05/hour=19
year=2020/month=10/day=05/hour=04
year=2020/month=10/day=05/hour=18
year=2020/month=10/day=05/hour=15
year=2020/month=10/day=05/hour=14
year=2020/month=10/day=05/hour=16
year=2020/month=10/day=05/hour=13
year=2020/month=10/day=05/hour=21
year=2020/month=10/day=05/hour=05
year=2020/month=10/day=05/hour=08
year=2020/month=10/day=05/hour=20
year=2020/month=10/day=05/hour=12
year=2020/month=10/day=05/hour=03
year=2020/month=10/day=05/hour=01
year=2020/month=10/day=05/hour=10
year=2020/month=10/day=05/hour=02
year=2020/month=10/day=05/hour=09
year=2020/month=10/day=05/hour=22
year=2020/month=10/day=05/hour=23
year=2020/month=10/day=05/hour=06
year=2020/month=10/day=05/hour=07
year=2020/month=10/day=05/hour=00
year=2020/month=10/day=04/hour=00
非常感谢你的帮助。
更多信息
CREATE TABLE
我使用的命令
创建Table
CREATE EXTERNAL TABLE `logs`(
`date` date,
`time` string,
`location` string,
`bytes` bigint,
`request_ip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`user_agent` string,
`query_string` string,
`cookie` string,
`result_type` string,
`request_id` string,
`host_header` string,
`request_protocol` string,
`request_bytes` bigint,
`time_taken` float,
`xforwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`response_result_type` string,
`http_version` string,
`fle_status` string,
`fle_encrypted_fields` int)
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
WITH SERDEPROPERTIES (
'input.regex'='^(?!#)([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/path'
TBLPROPERTIES (
'projection.date.format'='yyyy/MM/dd',
'projection.date.interval'='1',
'projection.date.interval.unit'='DAYS',
'projection.date.range'='2019/11/27, NOW-1DAYS',
'projection.date.type'='date',
'projection.day.type'='string',
'projection.enabled'='true',
'projection.hour.type'='string',
'projection.month.type'='string',
'projection.year.type'='string',
'skip.header.line.count'='2',
'storage.location.template'='s3://mybucket/path/distributionID/${year}/${month}/${day}/${hour}/',
'transient_lastDdlTime'='1575005094')
您的 table 使用分区投影,但您的配置与分区不匹配。分区投影是一个相当新的功能,文档仍然有一些不足之处,所以我完全理解它令人困惑。我想我明白你想做什么了。
分区投影配置必须与 table 的分区键完全匹配。在您的情况下,table 有四个分区键,分区投影配置提到了五个。除了四个的类型不对之外,没有string
分区投影类型
您可以通过进行两处更改来解决问题。首先像这样更改分区键:
PARTITIONED BY (
`date` string,
`hour` string
)
这会删除“年”、“月”和“日”分区键,取而代之的是“日期”键。仅仅因为它们是单独的“目录”就拥有单独的日期组件是没有必要的,仅仅拥有一个“日期”键将使查询更容易编写。
然后将 table 属性更改为:
TBLPROPERTIES (
'projection.date.format' = 'yyyy/MM/dd',
'projection.date.interval' = '1',
'projection.date.interval.unit' = 'DAYS',
'projection.date.range' = '2019/11/27, NOW-1DAYS',
'projection.date.type' = 'date',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0-23',
'projection.hour.digits' = '2'
'projection.enabled' = 'true',
'storage.location.template'='s3://mybucket/path/distributionID/${date}/${hour}/',
'skip.header.line.count' = '2'
)
这告诉 Athena,“日期”分区键的类型为 date
,并且它的格式为“YYYY/MM/DD”(对应于 S3 URI 中的格式,这很重要).它还告诉 Athena,“小时”分区键是范围为 0-23 的 integer
,格式为两位数(即 zero-filled)。最后,它指定了这些分区键如何映射到 S3 上的分区位置。当查询中的日期为“2020/10/06”时,该字符串将逐字插入到位置模板中。
通过这些更改,您应该能够 运行 查询如下(“date”是保留字,当它是列名时必须用引号引起来):
SELECT *
FROM logs
WHERE "date" = '2020/10/06'
SELECT *
FROM logs
WHERE "date" BETWEEN '2020/10/01' AND '2020/10/06'
AND hour BETWEEN 9 AND 21
请注意,日期格式必须与分区投影配置中的格式完全相同,即YYYY/MM/DD
。
Theo 的回答特别有助于解决小时和天的位数问题,因为我的 S3 分区格式为 YYYY/MM/DD:
'projection.hour.digits' = '2'
这是让它在这里发挥作用的关键。谢谢@theo
就我而言,我使用的是镶木地板文件:
CREATE EXTERNAL TABLE `table_name`(
`id` string,
-- more columns..
)
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://fullbucketname/full_prefix_dir'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'has_encrypted_data'='false',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.enabled'='true',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.month.digits' = '2',
'projection.year.range'='2020,2051',
'projection.year.type'='integer',
'storage.location.template'='s3://s3://fullbucketname/full_prefix_dir/year=${year}/month=${month}/day=${day}',
'typeOfData'='file')
问题总结
When I attempts to
SELECT
query the partitioned table withWHERE
clause, Athena produce an error.
有4种分区,在我的log
table.
- 年
string
- 月
string
- 天
string
- 小时
string
我在分区 table 上尝试 SELECT
查询。
但是收到以下错误消息。
错误信息
GENERIC_INTERNAL_ERROR: No value present
This query ran against the "default" database, unless qualified by the query.
SELECT 我试过的查询
SELECT *
FROM logs
WHERE year='2020'
AND month='10'
AND day ='05';
和
SELECT *
FROM "default"."logs"
WHERE year='2020'
AND month='10'
AND day ='05';
由于有关 No value present
的错误消息,我检查了分区结果。
SHOW PARTITIONS logs;
结果
year=2020/month=10/day=05/hour=17
year=2020/month=10/day=05/hour=11
year=2020/month=10/day=05/hour=19
year=2020/month=10/day=05/hour=04
year=2020/month=10/day=05/hour=18
year=2020/month=10/day=05/hour=15
year=2020/month=10/day=05/hour=14
year=2020/month=10/day=05/hour=16
year=2020/month=10/day=05/hour=13
year=2020/month=10/day=05/hour=21
year=2020/month=10/day=05/hour=05
year=2020/month=10/day=05/hour=08
year=2020/month=10/day=05/hour=20
year=2020/month=10/day=05/hour=12
year=2020/month=10/day=05/hour=03
year=2020/month=10/day=05/hour=01
year=2020/month=10/day=05/hour=10
year=2020/month=10/day=05/hour=02
year=2020/month=10/day=05/hour=09
year=2020/month=10/day=05/hour=22
year=2020/month=10/day=05/hour=23
year=2020/month=10/day=05/hour=06
year=2020/month=10/day=05/hour=07
year=2020/month=10/day=05/hour=00
year=2020/month=10/day=04/hour=00
非常感谢你的帮助。
更多信息
CREATE TABLE
我使用的命令
创建Table
CREATE EXTERNAL TABLE `logs`(
`date` date,
`time` string,
`location` string,
`bytes` bigint,
`request_ip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referrer` string,
`user_agent` string,
`query_string` string,
`cookie` string,
`result_type` string,
`request_id` string,
`host_header` string,
`request_protocol` string,
`request_bytes` bigint,
`time_taken` float,
`xforwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`response_result_type` string,
`http_version` string,
`fle_status` string,
`fle_encrypted_fields` int)
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
WITH SERDEPROPERTIES (
'input.regex'='^(?!#)([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)\\s+([^ \\t]+)$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/path'
TBLPROPERTIES (
'projection.date.format'='yyyy/MM/dd',
'projection.date.interval'='1',
'projection.date.interval.unit'='DAYS',
'projection.date.range'='2019/11/27, NOW-1DAYS',
'projection.date.type'='date',
'projection.day.type'='string',
'projection.enabled'='true',
'projection.hour.type'='string',
'projection.month.type'='string',
'projection.year.type'='string',
'skip.header.line.count'='2',
'storage.location.template'='s3://mybucket/path/distributionID/${year}/${month}/${day}/${hour}/',
'transient_lastDdlTime'='1575005094')
您的 table 使用分区投影,但您的配置与分区不匹配。分区投影是一个相当新的功能,文档仍然有一些不足之处,所以我完全理解它令人困惑。我想我明白你想做什么了。
分区投影配置必须与 table 的分区键完全匹配。在您的情况下,table 有四个分区键,分区投影配置提到了五个。除了四个的类型不对之外,没有string
分区投影类型
您可以通过进行两处更改来解决问题。首先像这样更改分区键:
PARTITIONED BY (
`date` string,
`hour` string
)
这会删除“年”、“月”和“日”分区键,取而代之的是“日期”键。仅仅因为它们是单独的“目录”就拥有单独的日期组件是没有必要的,仅仅拥有一个“日期”键将使查询更容易编写。
然后将 table 属性更改为:
TBLPROPERTIES (
'projection.date.format' = 'yyyy/MM/dd',
'projection.date.interval' = '1',
'projection.date.interval.unit' = 'DAYS',
'projection.date.range' = '2019/11/27, NOW-1DAYS',
'projection.date.type' = 'date',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0-23',
'projection.hour.digits' = '2'
'projection.enabled' = 'true',
'storage.location.template'='s3://mybucket/path/distributionID/${date}/${hour}/',
'skip.header.line.count' = '2'
)
这告诉 Athena,“日期”分区键的类型为 date
,并且它的格式为“YYYY/MM/DD”(对应于 S3 URI 中的格式,这很重要).它还告诉 Athena,“小时”分区键是范围为 0-23 的 integer
,格式为两位数(即 zero-filled)。最后,它指定了这些分区键如何映射到 S3 上的分区位置。当查询中的日期为“2020/10/06”时,该字符串将逐字插入到位置模板中。
通过这些更改,您应该能够 运行 查询如下(“date”是保留字,当它是列名时必须用引号引起来):
SELECT *
FROM logs
WHERE "date" = '2020/10/06'
SELECT *
FROM logs
WHERE "date" BETWEEN '2020/10/01' AND '2020/10/06'
AND hour BETWEEN 9 AND 21
请注意,日期格式必须与分区投影配置中的格式完全相同,即YYYY/MM/DD
。
Theo 的回答特别有助于解决小时和天的位数问题,因为我的 S3 分区格式为 YYYY/MM/DD:
'projection.hour.digits' = '2'
这是让它在这里发挥作用的关键。谢谢@theo
就我而言,我使用的是镶木地板文件:
CREATE EXTERNAL TABLE `table_name`(
`id` string,
-- more columns..
)
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://fullbucketname/full_prefix_dir'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='none',
'has_encrypted_data'='false',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.enabled'='true',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.month.digits' = '2',
'projection.year.range'='2020,2051',
'projection.year.type'='integer',
'storage.location.template'='s3://s3://fullbucketname/full_prefix_dir/year=${year}/month=${month}/day=${day}',
'typeOfData'='file')