查询统计多对多关联的频率

Question

我有两个 table 与 postgresql 中的多对多关联。第一个 table 包含活动，可以算作零个或多个原因：

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

为了执行关联，在这两个 table 之间存在连接 table：

CREATE TABLE activity_reason (
   activity_id integer NOT NULL, -- refers to activity.id
   reason_id varchar(1) NOT NULL, -- refers to reason.id
   CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
  CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);

我想统计活动和原因之间可能存在的关联。假设我在 table activity_reason 中有这些记录：

+--------------+------------+
| activity_id  |  reason_id |
+--------------+------------+
|           1  |          A |
|           1  |          B |
|           2  |          A |
|           2  |          B |
|           3  |          A |
|           4  |          C |
|           4  |          D |
|           4  |          E |
+--------------+------------+

我应该有这样的东西：

+-------+---+------+-------+
| count |   |      |       |
+-------+---+------+-------+
|     2 | A | B    | NULL  |
|     1 | A | NULL | NULL  |
|     1 | C | D    | E     |
+-------+---+------+-------+

或者，最终，类似于：

+-------+-------+
| count |       |
+-------+-------+
|     2 | A,B   |
|     1 | A     |
|     1 | C,D,E |
+-------+-------+

我找不到执行此操作的 SQL 查询。

Answer 1

您可以使用 string_agg():

select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
      from activity_reason
      group by activity_id
     ) a
group by reasons
order by count(*) desc;

Answer 2

我想你可以使用这个查询得到你想要的：

SELECT count(*) as count, reasons
FROM (
  SELECT activity_id, array_agg(reason_id) AS reasons
  FROM (
    SELECT A.activity_id, AR.reason_id
    FROM activity A
    LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
    ORDER BY activity_id, reason_id
  ) AS ordered_reasons
  GROUP BY activity_id
) reason_arrays
GROUP BY reasons

首先，您将 activity 的所有原因聚合到每个 activity 的数组中。您必须先对关联进行排序，否则 ['a'、'b'] 和 ['b'、'a'] 将被视为不同的集合并具有单独的计数。您还需要包括没有任何原因的联接或任何 activity 不会显示在结果集中。我不确定这是否可取，如果您想要没有理由不包括在内的活动，我可以将其撤回。然后计算具有相同原因集的活动数。

这里有一个sqlfiddle来演示

正如 Gordon Linoff 所提到的，您也可以使用字符串而不是数组。我不确定哪个对性能更好。

Answer 3

我们需要比较 排序的 个原因列表来识别相等的集合。

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id) AS reason_list
   FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

ORDER BY reason_id 在最里面的子查询中也可以，但是添加 activity_id 通常更快。

而且我们根本不需要最里面的子查询。这也有效：

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
   FROM   activity_reason
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

但处理全部或大部分 table 的速度通常较慢。 Quoting the manual:

Alternatively, supplying the input values from a sorted subquery will usually work.

我们可以使用string_agg()而不是array_agg()，这对你的例子varchar(1)有效（这可能更有效数据类型 "char"，顺便说一句）。但是，对于较长的字符串，它可能会失败。聚合值可能不明确。

如果 reason_id 将是 integer（就像通常那样），还有另一个更快的解决方案 sort() 来自附加模块 intarray:

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT sort(array_agg(reason_id)) AS reason_list
   FROM   activity_reason2
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

查询统计多对多关联的频率

Query to count the frequence of many-to-many associations

sql

arrays

postgresql

many-to-many

aggregate