如何将 "create table as" 与通过 SQL 中的分组依据聚合的日期一起使用？

Question

给定如下所示的数据，其中日期为字符串格式 YYYYMMDD:

item	vietnamese	cost	unique_id	sales_date
fruits	trai cay	10	abc123	20211001
fruits	trai cay	8	foo99	20211001
fruits	trai cay	9	foo99	20211001
vege	rau	3	rr1239	20211001
vege	rau	3	rr1239	20211001
fruits	trai cay	12	abc123	20211002
fruits	trai cay	14	abc123	20211002
fruits	trai cay	8	abc123	20211002
fruits	trai cay	5	foo99	20211002
vege	rau	8	rr1239	20211002
vege	rau	1	rr1239	20211002
vege	rau	12	ud9213	20211002
vege	rau	19	r11759	20211002
fruits	trai cay	6	foo99	20211003
fruits	trai cay	2	abc123	20211003
fruits	trai cay	12	abc123	20211003
vege	rau	1	ud97863	20211003
vege	rau	9	r112359	20211003
fruits	trai cay	6	foo99	20211004
fruits	trai cay	2	abc123	20211004
fruits	trai cay	12	abc123	20211004
vege	rau	9	r112359	20211004

目标是

select 特定时间范围内每 sales_date 最多 N 行
通过在项目列上使用 group by 聚合数据，

例如对于“20211002”和“20211004”之间每天最多 3 行：

SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

注意： vietnamese 列每个 item 是一对一的映射，因此 max(vietnamese)

上面的结果应该类似于：

item	vietnamese	costs	unique_ids
fruits	trai cay	[8]	[abc123]
vege	rau	[8, 1]	[rr1239, rr1239]
fruits	trai cay	[2, 12]	[abc123, abc123]
vege	rau	[1]	[ud97863]
fruits	trai cay	[6, 2, 12]	[foo99, abc123, abc123]

所需的输出已保存为 parquet 格式：

item	vietnamese	costs	unique_ids	sales_date
fruits	trai cay	[8]	[abc123]	20211002
vege	rau	[8, 1]	[rr1239, rr1239]	20211002
fruits	trai cay	[2, 12]	[abc123, abc123]	20211003
vege	rau	[1]	[ud97863]	20211003
fruits	trai cay	[6, 2, 12]	[foo99, abc123, abc123]	20211004

目的是保存到s3://somes3path/，目录结构如下：

s3://somes3path/
     item=fruits/
        sales_date=20211002
        sales_date=20211003
     item=vege/
        sales_date=20211002
        sales_date=20211003
        sales_date=20211004

如何在上面列出的目录结构中实现预期的输出？

我试过了，但它没有像我预期的那样将它保存在正确的目录结构中：

CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/', 
       partitioned_by = ARRAY['item'], 
       bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS 
SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            first(sales_date) as sales_date,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

Answer 1

您的输出仅按 item 分区，如果您将其更改为按 item 和 sales_date 分区，您将获得所需的目录结构。删除分桶，因为在 sales_date:

上分区时它不会有任何影响

WITH (
  format = 'PARQUET',
  external_location = 's3://somes3path/', 
  partitioned_by = ARRAY['item', 'sales_date']
)

如何将 "create table as" 与通过 SQL 中的分组依据聚合的日期一起使用？

How to use "create table as" with dates that is aggregated through group by in SQL?

sql

create-table

database-partitioning

amazon-athena