优化加入vertica
Optimizing join in vertica
我有一个这样的查询
SELECT a.column, b.column
FROM
table_a a INNER JOIN tableb_b ON
a.id= b.id
where a.anotherid = 'some condition'
它应该非常快,因为使用谓词 a.anotherid = 'some condition' 查询计划应该过滤 table_b 上的大量数据。
但是,根据Vertica的文档,
The WHERE clause is evaluated after the join is performed. It filters
records returned by the FROM clause, eliminating any records that do
not satisfy the WHERE clause condition.
也就是说查询会先做join再过滤,速度很慢,这在查询计划中也有体现
那么,有没有办法在加入之前推送过滤器?
或者还有其他优化查询的方法吗?
EXPLAIN
显示NO STATISTICS
。这些需要 updated.
- 在这种情况下,Vertica 将使用 SIP 优化谓词:
Sideways Information Passing (SIP) has been effective in improving join performance by filtering data as early as possible in the plan. It can be thought of as an advanced variation of predicate push down since the join is being used to do filtering [27]. For example, consider a HashJoin that joins two tables using simple equality predicates. The HashJoin will first create a hash table from the inner input before it starts reading data from the outer input to do the join. Special SIP filters are built during optimizer planning and placed in the Scan operator. At run time, the Scan has access to the Join’s hash table and the SIP filters are used to evaluate whether the outer key values exist in the hash table. Rows that do not pass these filters are not output by the Scan thus increaseing performance since we are not unnecessarily bringing the data through the plan only to be filtered away later by the join.
例如:
SELECT a.online_page_key
FROM online_sales.online_sales_fact a
JOIN online_sales.online_page_dimension b
ON b.online_page_key = a.online_page_key
WHERE b.page_type = 'quarterly';
将生成与以下相同的计划:
SELECT a.online_page_key
FROM online_sales.online_sales_fact a
JOIN (SELECT *
FROM online_sales.online_page_dimension
WHERE page_type = 'quarterly') b
ON b.online_page_key = a.online_page_key;
看起来像:
Access Path:
+-JOIN HASH [Cost: 14K, Rows: 988K] (PATH ID: 1)
| Join Cond: (online_page_dimension.online_page_key = a.online_page_key)
| +-- Outer -> STORAGE ACCESS for a [Cost: 12K, Rows: 5M] (PATH ID: 2)
| | Projection: online_sales.online_sales_fact_super
| | Materialize: a.online_page_key
| | Runtime Filter: (SIP1(HashJoin): a.online_page_key)
| +-- Inner -> STORAGE ACCESS for online_page_dimension [Cost: 36, Rows: 198] (PATH ID: 3)
| | Projection: online_sales.online_page_dimension_super
| | Materialize: online_page_dimension.online_page_key
| | Filter: (online_page_dimension.page_type = 'quarterly')
- 大多数时候,散列连接就足够了。如果您想改进合并联接,请参阅我在 optimizing for merge join 上的 post。
我有一个这样的查询
SELECT a.column, b.column
FROM
table_a a INNER JOIN tableb_b ON
a.id= b.id
where a.anotherid = 'some condition'
它应该非常快,因为使用谓词 a.anotherid = 'some condition' 查询计划应该过滤 table_b 上的大量数据。 但是,根据Vertica的文档,
The WHERE clause is evaluated after the join is performed. It filters records returned by the FROM clause, eliminating any records that do not satisfy the WHERE clause condition.
也就是说查询会先做join再过滤,速度很慢,这在查询计划中也有体现
那么,有没有办法在加入之前推送过滤器? 或者还有其他优化查询的方法吗?
EXPLAIN
显示NO STATISTICS
。这些需要 updated.- 在这种情况下,Vertica 将使用 SIP 优化谓词:
Sideways Information Passing (SIP) has been effective in improving join performance by filtering data as early as possible in the plan. It can be thought of as an advanced variation of predicate push down since the join is being used to do filtering [27]. For example, consider a HashJoin that joins two tables using simple equality predicates. The HashJoin will first create a hash table from the inner input before it starts reading data from the outer input to do the join. Special SIP filters are built during optimizer planning and placed in the Scan operator. At run time, the Scan has access to the Join’s hash table and the SIP filters are used to evaluate whether the outer key values exist in the hash table. Rows that do not pass these filters are not output by the Scan thus increaseing performance since we are not unnecessarily bringing the data through the plan only to be filtered away later by the join.
例如:
SELECT a.online_page_key
FROM online_sales.online_sales_fact a
JOIN online_sales.online_page_dimension b
ON b.online_page_key = a.online_page_key
WHERE b.page_type = 'quarterly';
将生成与以下相同的计划:
SELECT a.online_page_key
FROM online_sales.online_sales_fact a
JOIN (SELECT *
FROM online_sales.online_page_dimension
WHERE page_type = 'quarterly') b
ON b.online_page_key = a.online_page_key;
看起来像:
Access Path:
+-JOIN HASH [Cost: 14K, Rows: 988K] (PATH ID: 1)
| Join Cond: (online_page_dimension.online_page_key = a.online_page_key)
| +-- Outer -> STORAGE ACCESS for a [Cost: 12K, Rows: 5M] (PATH ID: 2)
| | Projection: online_sales.online_sales_fact_super
| | Materialize: a.online_page_key
| | Runtime Filter: (SIP1(HashJoin): a.online_page_key)
| +-- Inner -> STORAGE ACCESS for online_page_dimension [Cost: 36, Rows: 198] (PATH ID: 3)
| | Projection: online_sales.online_page_dimension_super
| | Materialize: online_page_dimension.online_page_key
| | Filter: (online_page_dimension.page_type = 'quarterly')
- 大多数时候,散列连接就足够了。如果您想改进合并联接,请参阅我在 optimizing for merge join 上的 post。