SQL 查询以查找具有特定关联数的行

Question

使用 Postgres 我有一个包含 conversations 和 conversationUsers 的模式。每个 conversation 有很多 conversationUsers。我希望能够找到具有确切指定数量 conversationUsers 的对话。换句话说，如果提供了一个 userIds（比如 [1, 4, 6]）的数组，我希望能够找到仅包含那些用户的对话，而不是更多。

到目前为止我试过这个：

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;

不幸的是，这似乎也 return 包括这 2 个用户在内的对话。（例如，如果对话还包含 "userId" 5，则结果为 return）。

Answer 1

您可以像这样修改您的查询，它应该可以工作：

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
    SELECT DISTINCT c1."conversationId"
    FROM "conversationUsers" c1
    WHERE c1."userId" IN (1, 4)
    )
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;

Answer 2

这可能更容易理解。你想要对话 ID，按它分组。添加 HAVING 子句，根据匹配用户 ID 的总和等于组内所有可能的用户 ID。这会起作用，但由于没有预限定符，处理时间会更长。

select
      cu.ConversationId
   from
      conversationUsers cu
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

为了进一步简化列表，请预先查询至少有一个人参与的对话...如果他们一开始就不参与，为什么还要考虑其他此类对话。

select
      cu.ConversationId
   from
      ( select cu2.ConversationID
           from conversationUsers cu2
           where cu2.userID = 4 ) preQual
      JOIN conversationUsers cu
         preQual.ConversationId = cu.ConversationId
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

Answer 3

这是 relational-division 的情况 - 添加了特殊要求，即同一对话不得有其他用户。

假设是table"conversationUsers"的PK，它强制组合的唯一性，NOT NULL并且还隐含地提供对性能至关重要的索引. 这个顺序的多列PK的列！否则你必须做更多。
关于索引列的顺序：

Is a composite index also good for queries on the first field?

对于基本查询，有 "brute force" 方法来计算 all 对话的匹配用户数所有给定的用户，然后过滤匹配所有给定用户的用户。对于小 tables and/or 只有短输入数组 and/or 每个用户的对话很少，但是 扩展性不好 :

SELECT "conversationId"
FROM   "conversationUsers" c
WHERE  "userId" = ANY ('{1,4,6}'::int[])
GROUP  BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = c."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

使用 NOT EXISTS 反半连接消除与其他用户的对话。更多：

How do I (or can I) SELECT DISTINCT on multiple columns?

替代技术：

Select rows which are not present in other table

还有其他各种（快得多）relational-division 查询技术。但是最快的并不适合动态数量的用户 ID。

How to filter SQL results in a has-many-through relation

对于还可以处理动态数量的用户 ID 的快速查询，请考虑 recursive CTE:

WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = ('{1,4,6}'::int[])[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = ('{1,4,6}'::int[])[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length(('{1,4,6}'::int[]), 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

为了便于使用，将其包装在函数中或 prepared statement。喜欢：

PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = [1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = [idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length(, 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL();

致电：

EXECUTE conversations('{1,4,6}');

db<>fiddle here（还演示了一个函数）

仍有改进空间：要获得 top 性能，您必须将对话最少的用户放在输入数组中的第一位，以尽早消除尽可能多的行。为了获得最佳性能，您可以动态生成非动态、非递归查询（使用第一个 link 中的 fast 技术之一）并依次执行。您甚至可以将它包装在一个带有动态 SQL ...

的单个 plpgsql 函数中

更多解释：

Using same column multiple times in WHERE clause

备选：MV为稀疏写table

如果 table "conversationUsers" 大部分是只读的（旧对话不太可能改变），您可以使用 MATERIALIZED VIEW 和排序数组中的预聚合用户并创建一个该数组列上的普通 btree 索引。

CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users  -- sorted array
FROM (
   SELECT "conversationId", "userId"
   FROM   "conversationUsers"
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");

演示的覆盖索引需要 Postgres 11。参见：

https://dba.stackexchange.com/a/207938/3684

关于对子查询中的行进行排序：

How to apply ORDER BY and LIMIT in combination with an aggregate function?

在旧版本中，在 (users, "conversationId") 上使用普通的多列索引。对于非常长的数组，散列索引在 Postgres 10 或更高版本中可能有意义。

那么更快的查询就是：

SELECT "conversationId"
FROM   mv_conversation_users c
WHERE  users = '{1,4,6}'::int[];  -- sorted array!

db<>fiddle here

您必须权衡存储、写入和维护的额外成本与读取性能的好处。

旁白：考虑不带双引号的合法标识符。 conversation_id 而不是 "conversationId" 等等：

Are PostgreSQL column names case-sensitive?

SQL 查询以查找具有特定关联数的行

SQL query to find a row with a specific number of associations

sql

postgresql

sequelize.js

relational-division

备选：MV为稀疏写table