查询统计多对多关联的频率
Query to count the frequence of many-to-many associations
我有两个 table 与 postgresql 中的多对多关联。第一个 table 包含活动,可以算作零个或多个原因:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
为了执行关联,在 这两个 table 之间存在连接 table:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
我想统计活动和原因之间可能存在的关联。假设我在 table activity_reason
中有这些记录:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
我应该有这样的东西:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
或者,最终,类似于:
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
我找不到执行此操作的 SQL 查询。
您可以使用 string_agg()
:
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;
我想你可以使用这个查询得到你想要的:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
首先,您将 activity 的所有原因聚合到每个 activity 的数组中。您必须先对关联进行排序,否则 ['a'、'b'] 和 ['b'、'a'] 将被视为不同的集合并具有单独的计数。您还需要包括没有任何原因的联接或任何 activity 不会显示在结果集中。我不确定这是否可取,如果您想要没有理由不包括在内的活动,我可以将其撤回。然后计算具有相同原因集的活动数。
这里有一个sqlfiddle来演示
正如 Gordon Linoff 所提到的,您也可以使用字符串而不是数组。我不确定哪个对性能更好。
我们需要比较 排序的 个原因列表来识别相等的集合。
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id
在最里面的子查询中也可以,但是添加 activity_id
通常更快。
而且我们根本不需要最里面的子查询。这也有效:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
但处理全部或大部分 table 的速度通常较慢。 Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
我们可以使用string_agg()
而不是array_agg()
,这对你的例子varchar(1)
有效(这可能更有效数据类型 "char"
,顺便说一句)。但是,对于较长的字符串,它可能会失败。聚合值可能不明确。
如果 reason_id
将是 integer
(就像通常那样),还有另一个更快的解决方案 sort()
来自附加模块 intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
相关,更多解释:
- Compare arrays for equality, ignoring order of elements
我有两个 table 与 postgresql 中的多对多关联。第一个 table 包含活动,可以算作零个或多个原因:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
为了执行关联,在 这两个 table 之间存在连接 table:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
我想统计活动和原因之间可能存在的关联。假设我在 table activity_reason
中有这些记录:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
我应该有这样的东西:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
或者,最终,类似于:
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
我找不到执行此操作的 SQL 查询。
您可以使用 string_agg()
:
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;
我想你可以使用这个查询得到你想要的:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
首先,您将 activity 的所有原因聚合到每个 activity 的数组中。您必须先对关联进行排序,否则 ['a'、'b'] 和 ['b'、'a'] 将被视为不同的集合并具有单独的计数。您还需要包括没有任何原因的联接或任何 activity 不会显示在结果集中。我不确定这是否可取,如果您想要没有理由不包括在内的活动,我可以将其撤回。然后计算具有相同原因集的活动数。
这里有一个sqlfiddle来演示
正如 Gordon Linoff 所提到的,您也可以使用字符串而不是数组。我不确定哪个对性能更好。
我们需要比较 排序的 个原因列表来识别相等的集合。
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id
在最里面的子查询中也可以,但是添加 activity_id
通常更快。
而且我们根本不需要最里面的子查询。这也有效:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
但处理全部或大部分 table 的速度通常较慢。 Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
我们可以使用string_agg()
而不是array_agg()
,这对你的例子varchar(1)
有效(这可能更有效数据类型 "char"
,顺便说一句)。但是,对于较长的字符串,它可能会失败。聚合值可能不明确。
如果 reason_id
将是 integer
(就像通常那样),还有另一个更快的解决方案 sort()
来自附加模块 intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
相关,更多解释:
- Compare arrays for equality, ignoring order of elements