从连接两个表的查询中获取不同计数的最佳方法

Question

我有 2 tables, table A & table B.

Table A（有几千行）

编号
uuid
名字
输入
created_by
org_id

Table B（最多一百行）

org_id
org_name

我正在尝试获取最佳连接查询以获得带有 WHERE 子句的计数。我需要 table A 中不同 created_by 的计数，其中 Table B 中的 org_name 包含 'myorg'。我目前有以下查询（产生预期结果），想知道是否可以进一步优化它？

select count(distinct a.created_by)
from a left join
     b
     on a.org_id = b.org_id 
where b.org_name like '%myorg%';

Answer 1

你不需要 left join:

select count(distinct a.created_by)
from a join
     b
     on a.org_id = b.org_id
where b.org_name like '%myorg%'

对于此查询，您需要 b.org_id 上的索引，我假设您有。

Answer 2

我会为此使用 exists：

select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')

b(org_id) 上的索引会有所帮助。但是在性能方面，重点是：

使用 like 和两边的通配符进行搜索不利于性能（这不能利用索引）；最好搜索完全匹配，或者至少不要在字符串的左侧使用通配符。
count(distinct ...) 比普通 count() 贵；如果你真的不需要distinct，那就不要使用它。

Answer 3

您的查询看起来不错。使用普通的 [INNER] JOIN 或 LEFT [OUTER] JOIN，就像戈登建议的那样。但这不会有太大变化。

你提到 table B 只有 ...

a max of hundred rows

而 table A 有 ...

thousands of rows

如果每个 created_by 有很多行（我希望如此），那么有可能emulated index skip scan.
（模拟它的需要可能会消失 in one of the coming Postgres versions。）

基本成分是这个多列索引:

CREATE INDEX ON a (org_id, created_by);

它可以替换仅 (org_id) 上的简单索引，也适用于您的简单查询。参见：

Is a composite index also good for queries on the first field?

您的案例有两个并发症：

DISTINCT
0-n org_id 来自 org_name like '%myorg%'

所以优化更难实现。但仍然可能有一些花哨 SQL:

SELECT count(DISTINCT created_by)  -- does not count NULL (as desired)
FROM   b
CROSS  JOIN LATERAL (
   WITH RECURSIVE t AS (
      (  -- parentheses required
      SELECT created_by
      FROM   a
      WHERE  org_id = b.org_id
      ORDER  BY created_by
      LIMIT 1
      )
      UNION ALL
      SELECT (SELECT created_by
              FROM   a
              WHERE  org_id = b.org_id
              AND    created_by > t.created_by
              ORDER  BY created_by
              LIMIT  1)
      FROM   t
      WHERE  t.created_by IS NOT NULL  -- stop recursion
      )
   TABLE t
   ) a
WHERE  b.org_name LIKE '%myorg%';

db<>fiddle here（Postgres 12，但也适用于 Postgres 9.6。）

这是 LATERAL 子查询中的 recursive CTE，使用相关子查询。

它利用上面的多列索引为每个 (org_id, created_by) 只检索一个单个行。如果 table 足够吸尘，则使用仅索引扫描。

精进SQL的主要objective是完全避免big [=上的顺序扫描（甚至是位图索引扫描） 111=] 并且只读取很少的快速索引元组。

由于增加了开销，对于不利的数据分布（许多 org_id and/or 仅少数行每 created_by) 但它很多在有利条件下更快并且扩展性极好，即使是数百万行.您必须通过测试才能找到最佳点。

从连接两个表的查询中获取不同计数的最佳方法

Best way to get distinct count from a query joining two tables

sql

postgresql

join

postgresql-performance

postgres-9.6