如何将 "create table as" 与通过 SQL 中的分组依据聚合的日期一起使用?
How to use "create table as" with dates that is aggregated through group by in SQL?
给定如下所示的数据,其中日期为字符串格式 YYYYMMDD
:
item
vietnamese
cost
unique_id
sales_date
fruits
trai cay
10
abc123
20211001
fruits
trai cay
8
foo99
20211001
fruits
trai cay
9
foo99
20211001
vege
rau
3
rr1239
20211001
vege
rau
3
rr1239
20211001
fruits
trai cay
12
abc123
20211002
fruits
trai cay
14
abc123
20211002
fruits
trai cay
8
abc123
20211002
fruits
trai cay
5
foo99
20211002
vege
rau
8
rr1239
20211002
vege
rau
1
rr1239
20211002
vege
rau
12
ud9213
20211002
vege
rau
19
r11759
20211002
fruits
trai cay
6
foo99
20211003
fruits
trai cay
2
abc123
20211003
fruits
trai cay
12
abc123
20211003
vege
rau
1
ud97863
20211003
vege
rau
9
r112359
20211003
fruits
trai cay
6
foo99
20211004
fruits
trai cay
2
abc123
20211004
fruits
trai cay
12
abc123
20211004
vege
rau
9
r112359
20211004
目标是
- select 特定时间范围内每
sales_date
最多 N 行
- 通过在项目列上使用
group by
聚合数据,
例如对于“20211002”和“20211004”之间每天最多 3 行:
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
注意: vietnamese
列每个 item
是一对一的映射,因此 max(vietnamese)
上面的结果应该类似于:
item
vietnamese
costs
unique_ids
fruits
trai cay
[8]
[abc123]
vege
rau
[8, 1]
[rr1239, rr1239]
fruits
trai cay
[2, 12]
[abc123, abc123]
vege
rau
[1]
[ud97863]
fruits
trai cay
[6, 2, 12]
[foo99, abc123, abc123]
所需的输出已保存为 parquet
格式:
item
vietnamese
costs
unique_ids
sales_date
fruits
trai cay
[8]
[abc123]
20211002
vege
rau
[8, 1]
[rr1239, rr1239]
20211002
fruits
trai cay
[2, 12]
[abc123, abc123]
20211003
vege
rau
[1]
[ud97863]
20211003
fruits
trai cay
[6, 2, 12]
[foo99, abc123, abc123]
20211004
目的是保存到s3://somes3path/
,目录结构如下:
s3://somes3path/
item=fruits/
sales_date=20211002
sales_date=20211003
item=vege/
sales_date=20211002
sales_date=20211003
sales_date=20211004
如何在上面列出的目录结构中实现预期的输出?
我试过了,但它没有像我预期的那样将它保存在正确的目录结构中:
CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/',
partitioned_by = ARRAY['item'],
bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
first(sales_date) as sales_date,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
您的输出仅按 item
分区,如果您将其更改为按 item
和 sales_date
分区,您将获得所需的目录结构。删除分桶,因为在 sales_date
:
上分区时它不会有任何影响
WITH (
format = 'PARQUET',
external_location = 's3://somes3path/',
partitioned_by = ARRAY['item', 'sales_date']
)
给定如下所示的数据,其中日期为字符串格式 YYYYMMDD
:
item | vietnamese | cost | unique_id | sales_date |
---|---|---|---|---|
fruits | trai cay | 10 | abc123 | 20211001 |
fruits | trai cay | 8 | foo99 | 20211001 |
fruits | trai cay | 9 | foo99 | 20211001 |
vege | rau | 3 | rr1239 | 20211001 |
vege | rau | 3 | rr1239 | 20211001 |
fruits | trai cay | 12 | abc123 | 20211002 |
fruits | trai cay | 14 | abc123 | 20211002 |
fruits | trai cay | 8 | abc123 | 20211002 |
fruits | trai cay | 5 | foo99 | 20211002 |
vege | rau | 8 | rr1239 | 20211002 |
vege | rau | 1 | rr1239 | 20211002 |
vege | rau | 12 | ud9213 | 20211002 |
vege | rau | 19 | r11759 | 20211002 |
fruits | trai cay | 6 | foo99 | 20211003 |
fruits | trai cay | 2 | abc123 | 20211003 |
fruits | trai cay | 12 | abc123 | 20211003 |
vege | rau | 1 | ud97863 | 20211003 |
vege | rau | 9 | r112359 | 20211003 |
fruits | trai cay | 6 | foo99 | 20211004 |
fruits | trai cay | 2 | abc123 | 20211004 |
fruits | trai cay | 12 | abc123 | 20211004 |
vege | rau | 9 | r112359 | 20211004 |
目标是
- select 特定时间范围内每
sales_date
最多 N 行 - 通过在项目列上使用
group by
聚合数据,
例如对于“20211002”和“20211004”之间每天最多 3 行:
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
注意: vietnamese
列每个 item
是一对一的映射,因此 max(vietnamese)
上面的结果应该类似于:
item | vietnamese | costs | unique_ids |
---|---|---|---|
fruits | trai cay | [8] | [abc123] |
vege | rau | [8, 1] | [rr1239, rr1239] |
fruits | trai cay | [2, 12] | [abc123, abc123] |
vege | rau | [1] | [ud97863] |
fruits | trai cay | [6, 2, 12] | [foo99, abc123, abc123] |
所需的输出已保存为 parquet
格式:
item | vietnamese | costs | unique_ids | sales_date |
---|---|---|---|---|
fruits | trai cay | [8] | [abc123] | 20211002 |
vege | rau | [8, 1] | [rr1239, rr1239] | 20211002 |
fruits | trai cay | [2, 12] | [abc123, abc123] | 20211003 |
vege | rau | [1] | [ud97863] | 20211003 |
fruits | trai cay | [6, 2, 12] | [foo99, abc123, abc123] | 20211004 |
目的是保存到s3://somes3path/
,目录结构如下:
s3://somes3path/
item=fruits/
sales_date=20211002
sales_date=20211003
item=vege/
sales_date=20211002
sales_date=20211003
sales_date=20211004
如何在上面列出的目录结构中实现预期的输出?
我试过了,但它没有像我预期的那样将它保存在正确的目录结构中:
CREATE TABLE somedb.mytable
WITH ( format = 'PARQUET', external_location = 's3://somes3path/',
partitioned_by = ARRAY['item'],
bucketed_by = ARRAY['sales_date'], bucket_count = 30) AS
SELECT *
FROM
(SELECT item,
max(vietnamese) as vietnamese,
sum(cost) as total_cost,
array_agg(cost) as costs,
array_agg(unique_id) as unique_ids,
first(sales_date) as sales_date,
row_number() over (partition by max(sales_date) order by rand()) as row
FROM mytable
where sales_date between '20211002' and '20211004'
GROUP BY item)
where row <= 3
limit 9
您的输出仅按 item
分区,如果您将其更改为按 item
和 sales_date
分区,您将获得所需的目录结构。删除分桶,因为在 sales_date
:
WITH (
format = 'PARQUET',
external_location = 's3://somes3path/',
partitioned_by = ARRAY['item', 'sales_date']
)