如何将 "create table as" 与通过 SQL 中的分组依据聚合的日期一起使用?

How to use "create table as" with dates that is aggregated through group by in SQL?

给定如下所示的数据,其中日期为字符串格式 YYYYMMDD:

item vietnamese cost unique_id sales_date
fruits trai cay 10 abc123 20211001
fruits trai cay 8 foo99 20211001
fruits trai cay 9 foo99 20211001
vege rau 3 rr1239 20211001
vege rau 3 rr1239 20211001
fruits trai cay 12 abc123 20211002
fruits trai cay 14 abc123 20211002
fruits trai cay 8 abc123 20211002
fruits trai cay 5 foo99 20211002
vege rau 8 rr1239 20211002
vege rau 1 rr1239 20211002
vege rau 12 ud9213 20211002
vege rau 19 r11759 20211002
fruits trai cay 6 foo99 20211003
fruits trai cay 2 abc123 20211003
fruits trai cay 12 abc123 20211003
vege rau 1 ud97863 20211003
vege rau 9 r112359 20211003
fruits trai cay 6 foo99 20211004
fruits trai cay 2 abc123 20211004
fruits trai cay 12 abc123 20211004
vege rau 9 r112359 20211004

目标是

例如对于“20211002”和“20211004”之间每天最多 3 行:

SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

注意: vietnamese 列每个 item 是一对一的映射,因此 max(vietnamese)

上面的结果应该类似于:

item vietnamese costs unique_ids
fruits trai cay [8] [abc123]
vege rau [8, 1] [rr1239, rr1239]
fruits trai cay [2, 12] [abc123, abc123]
vege rau [1] [ud97863]
fruits trai cay [6, 2, 12] [foo99, abc123, abc123]

所需的输出已保存为 parquet 格式:

item vietnamese costs unique_ids sales_date
fruits trai cay [8] [abc123] 20211002
vege rau [8, 1] [rr1239, rr1239] 20211002
fruits trai cay [2, 12] [abc123, abc123] 20211003
vege rau [1] [ud97863] 20211003
fruits trai cay [6, 2, 12] [foo99, abc123, abc123] 20211004

目的是保存到s3://somes3path/,目录结构如下:

s3://somes3path/
     item=fruits/
        sales_date=20211002
        sales_date=20211003
     item=vege/
        sales_date=20211002
        sales_date=20211003
        sales_date=20211004

如何在上面列出的目录结构中实现预期的输出?


我试过了,但它没有像我预期的那样将它保存在正确的目录结构中:

CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/', 
       partitioned_by = ARRAY['item'], 
       bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS 
SELECT *
FROM 
    (SELECT item, 
            max(vietnamese) as vietnamese,
            sum(cost) as total_cost,
            array_agg(cost) as costs,
            array_agg(unique_id) as unique_ids,
            first(sales_date) as sales_date,
            row_number() over (partition by max(sales_date) order by rand()) as row
     FROM mytable
     where sales_date between '20211002' and '20211004'
  GROUP BY item)
where row <= 3
limit 9

您的输出仅按 item 分区,如果您将其更改为按 itemsales_date 分区,您将获得所需的目录结构。删除分桶,因为在 sales_date:

上分区时它不会有任何影响
WITH (
  format = 'PARQUET',
  external_location = 's3://somes3path/', 
  partitioned_by = ARRAY['item', 'sales_date']
)