Snowflake 查询引擎策略在几个带查询条件
Snowflake query engine strategy on several with query conditions
我正在执行从 pyspark 查询到 snowflake 查询的迁移作业,想知道下面 A、B 选项之间哪个选项更好。
为了避免不必要的查询,如果没有那么明显的性能差异,我想选择 B 选项。
在 B 选项中,snowflake 查询引擎是否自动优化并在内部表现得像 A 选项?
一个选项
With A1 AS (select * from a1 where date='2021-10-20'),
A2 AS (select * from a2 where date='2021-10-20'),
A3 AS (select * from a3 where date='2021-10-20'),
A4 AS (select * from a4 where date='2021-10-20'),
A5 AS (select * from a5 where date='2021-10-20')
SELECT *
FROM final_merged_table
和B选项
With A1 AS (select * from a1),
A2 AS (select * from a2),
A3 AS (select * from a3),
A4 AS (select * from a4),
A5 AS (select * from a5)
SELECT *
FROM final_merged_table
WHERE date = '2021-10-20'
我们可以测试一下。首先,让我们构建一个包含一周日期和几百万行的 table:
create or replace table one_week2
as
select '2020-04-01'::date + (7*seq8()/100000000)::int day, random() data, random() data2, random() data3
from table(generator(rowcount => 100000000))
现在我们可以编写两个查询来检查这个 table:
选项 1:
With A1 AS (select * from one_week2 where day='2020-04-05'),
A2 AS (select * from one_week2 where day='2020-04-05'),
A3 AS (select * from one_week2 where day='2020-04-05'),
A4 AS (select * from one_week2 where day='2020-04-05'),
A5 AS (select * from one_week2 where day='2020-04-05'),
final_merged_table as (
select * from a1
union all select * from a2
union all select * from a3
union all select * from a4
union all select * from a5)
SELECT count(*)
FROM final_merged_table
选项 2:
With A1 AS (select * from one_week2),
A2 AS (select * from one_week2),
A3 AS (select * from one_week2),
A4 AS (select * from one_week2),
A5 AS (select * from one_week2),
final_merged_table as (
select * from a1
union all select * from a2
union all select * from a3
union all select * from a4
union all select * from a5)
SELECT count(*)
FROM final_merged_table
where day='2020-04-05'
;
当我们 运行 这些查询时,两者的配置文件看起来相同 - 因为过滤器已被下推:
选项 1 配置文件
选项 2 配置文件
总结
您可以信任 Snowflake 优化器。
信任很重要,但也要验证:有时优化器可能会被复杂的 CTE 弄糊涂。有时 Snowflake 工程师会优化优化器,今天不起作用的东西明天会更好。
我正在执行从 pyspark 查询到 snowflake 查询的迁移作业,想知道下面 A、B 选项之间哪个选项更好。
为了避免不必要的查询,如果没有那么明显的性能差异,我想选择 B 选项。
在 B 选项中,snowflake 查询引擎是否自动优化并在内部表现得像 A 选项?
一个选项
With A1 AS (select * from a1 where date='2021-10-20'),
A2 AS (select * from a2 where date='2021-10-20'),
A3 AS (select * from a3 where date='2021-10-20'),
A4 AS (select * from a4 where date='2021-10-20'),
A5 AS (select * from a5 where date='2021-10-20')
SELECT *
FROM final_merged_table
和B选项
With A1 AS (select * from a1),
A2 AS (select * from a2),
A3 AS (select * from a3),
A4 AS (select * from a4),
A5 AS (select * from a5)
SELECT *
FROM final_merged_table
WHERE date = '2021-10-20'
我们可以测试一下。首先,让我们构建一个包含一周日期和几百万行的 table:
create or replace table one_week2
as
select '2020-04-01'::date + (7*seq8()/100000000)::int day, random() data, random() data2, random() data3
from table(generator(rowcount => 100000000))
现在我们可以编写两个查询来检查这个 table:
选项 1:
With A1 AS (select * from one_week2 where day='2020-04-05'),
A2 AS (select * from one_week2 where day='2020-04-05'),
A3 AS (select * from one_week2 where day='2020-04-05'),
A4 AS (select * from one_week2 where day='2020-04-05'),
A5 AS (select * from one_week2 where day='2020-04-05'),
final_merged_table as (
select * from a1
union all select * from a2
union all select * from a3
union all select * from a4
union all select * from a5)
SELECT count(*)
FROM final_merged_table
选项 2:
With A1 AS (select * from one_week2),
A2 AS (select * from one_week2),
A3 AS (select * from one_week2),
A4 AS (select * from one_week2),
A5 AS (select * from one_week2),
final_merged_table as (
select * from a1
union all select * from a2
union all select * from a3
union all select * from a4
union all select * from a5)
SELECT count(*)
FROM final_merged_table
where day='2020-04-05'
;
当我们 运行 这些查询时,两者的配置文件看起来相同 - 因为过滤器已被下推:
选项 1 配置文件
选项 2 配置文件
总结
您可以信任 Snowflake 优化器。
信任很重要,但也要验证:有时优化器可能会被复杂的 CTE 弄糊涂。有时 Snowflake 工程师会优化优化器,今天不起作用的东西明天会更好。