为什么 Apache Calcite 为查询包含的所有表估计 100 行？

Question

我最近尝试使用三个 CSV 文件在 Apache Calcite 中执行查询 tables

TTLA_ONE 包含 59 行
TTLR_ONE 包含 61390 行
EMPTY_T 包含 0 行

这是执行的查询：

EXPLAIN PLAN FOR SELECT COUNT(*) as NUM 
FROM TTLA_ONE A 
INNER JOIN TTLR_ONE B1 ON A.X = B1.X
INNER JOIN TTLR_ONE B2 ON B2.X = B1.X
INNER JOIN EMPTY_T C1 ON C1.X = B2.Y
INNER JOIN EMPTY_T C2 ON C2.X = C2.X

查询结果始终为零，因为我们正在加入一个空 table。得到的方案是：

EnumerableAggregate(group=[{}], NUM=[COUNT()])
  EnumerableJoin(condition=[=(, )], joinType=[inner])
    EnumerableJoin(condition=[=([=11=], )], joinType=[inner])
      EnumerableInterpreter
        BindableTableScan(table=[[STYPES, TTLA_ONE]])
      EnumerableCalc(expr#0..1=[{inputs}], X=[$t0])
        EnumerableInterpreter
          BindableTableScan(table=[[STYPES, TTLR_ONE]])
    EnumerableJoin(condition=[=(, )], joinType=[inner])
      EnumerableJoin(condition=[true], joinType=[inner])
        EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1])
          EnumerableInterpreter
            BindableTableScan(table=[[STYPES, EMPTY_T]])
        EnumerableInterpreter
          BindableTableScan(table=[[STYPES, EMPTY_T]])
      EnumerableInterpreter
        BindableTableScan(table=[[STYPES, TTLR_ONE]])

可能会注意到空的 tables 在最后的计划中使用。

我在此添加一个例子test code。

我深入研究代码并打开日志进行调试，我看到所有 table 行都估计为 100，但这不是真的。

下面，可以找到调试模式下日志设置的计划估计：

  EnumerableJoin(condition=[=(, )], joinType=[inner]): rowcount = 3.0375E7, cumulative cost = {3.075002214917643E7 rows, 950.0 cpu, 0.0 io}, id = 26284
EnumerableJoin(condition=[=([=12=], )], joinType=[inner]): rowcount = 1500.0, cumulative cost = {2260.517018598809 rows, 400.0 cpu, 0.0 io}, id = 26267
  EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26260
    BindableTableScan(table=[[STYPES, TTLA_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7789
  EnumerableCalc(expr#0..1=[{inputs}], X=[$t0]): rowcount = 100.0, cumulative cost = {150.0 rows, 350.0 cpu, 0.0 io}, id = 26290
    EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26263
      BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791
EnumerableJoin(condition=[=(, )], joinType=[inner]): rowcount = 135000.0, cumulative cost = {226790.8015771949 rows, 550.0 cpu, 0.0 io}, id = 26282
  EnumerableJoin(condition=[true], joinType=[inner]): rowcount = 9000.0, cumulative cost = {9695.982870329724 rows, 500.0 cpu, 0.0 io}, id = 26277
    EnumerableCalc(expr#0=[{inputs}], expr#1=[IS NOT NULL($t0)], X=[$t0], $condition=[$t1]): rowcount = 90.0, cumulative cost = {140.0 rows, 450.0 cpu, 0.0 io}, id = 26288
      EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26270
        BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
    EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26275
      BindableTableScan(table=[[STYPES, EMPTY_T]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7787
  EnumerableInterpreter: rowcount = 100.0, cumulative cost = {50.0 rows, 50.0 cpu, 0.0 io}, id = 26280
    BindableTableScan(table=[[STYPES, TTLR_ONE]]): rowcount = 100.0, cumulative cost = {1.0 rows, 1.01 cpu, 0.0 io}, id = 7791

我们可以明确地看到，对于每个 table，估计总是 100 rowcount = 100.0。

查询正确执行，但计划未优化。有谁知道为什么 table 统计数据没有被正确评估？

Answer 1

此处的答案似乎与已从评论中链接的问题相同。

Flink does not (yet) reorder joins

In the current version (1.7.1, Jan 2019), ... Calcite uses its default value which is 100.

所以执行计划不是寻找零行的表。特别是，我从这些答案中怀疑，即使您对 FROM 子句中的表重新排序，它仍然不会注意到。

一般来说，SQL优化是由索引的可用性和表的基数驱动的。

The only way to inject cardinality estimates for tables is via an ExternalCatalog.

你在做吗？

如果您将这些表作为 CSV 文件加载，是否声明了键和索引以及目录所需的其他内容？

听起来方解石不是成熟的产品。如果您正在寻找测试平台来检查 SQL optimisations/query 计划，请使用不同的产品。

Answer 2

问题是，在 class CsvTable 中，有必要重写 getStatistic 属性方法，方法如下：

 private Statistic statistic;
 // todo: assign statistics  

  @Override
  public Statistic getStatistic() {
    return statistic;
  }

可能从构造函数传递这些统计信息或注入一些生成它们的对象。

目前它 returns 只是 Statistics.UNKNOWN 在 superclass 实现 AbstractTable` 中。当然，如果没有统计，计划的估计成本是不正确的。

为什么 Apache Calcite 为查询包含的所有表估计 100 行？

Why does Apache Calcite estimates 100 rows for all tables a query contains?

csv

relational-algebra

sql-optimization

apache-calcite