查找看过相同节目的观众(每个匹配多行)
Find spectators that have seen the same shows (match multiple rows for each)
对于一项作业,我必须为存储在 PostgreSQL 服务器 运行 PostgreSQL 9.3.0 中的数据库编写几个 SQL 查询。但是,我发现自己被最后一个查询阻止了。该数据库为歌剧院的预订系统建模。查询是关于将每次协助同一事件的观众与其他观众相关联。
模型看起来像这样:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez@gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille@gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
到目前为止,我已经开始编写查询以获取观众的 ID 和他正在观看的节目的日期,查询如下所示。
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
有人可以帮助我更好地理解问题并提示我找到解决方案。提前致谢。
所以我期待的结果应该是这样的
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
这意味着每次观众 id 1 去看演出时,观众 2 和 3 也会去看。
听起来您已经完成了全部问题的一半——确定 id_shows 一个特定的 id_spectator 参加了哪一个。
您想问自己的是如何确定哪些 id_spectator 参加了 id_show,给定 id_show。一旦你有了它,将两个答案结合起来得到完整的结果。
基于评论的注释:想要明确说明这个答案可能用途有限,因为它是在 SQL-Server 的上下文中回答的(标签是当时在场)
可能有更好的方法,但您可以使用“填充”功能来完成。这里唯一的缺点是,由于您的 ID 是整数,因此在值之间放置逗号将涉及变通(需要是字符串)。以下是我能想到的解决方法。
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
这将向您显示观看同一节目的所有其他观众。
所以我得到的最终答案是这样的:
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
打印出如下内容:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
这符合我的需要,但是如果您有任何改进,请分享:)再次感谢大家!
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
换句话说,您需要一个列表...
所有看过给定观众看过的所有节目的观众(并且可能比给定的观众更多)
这是关系划分的特例。我们在这里汇集了一系列基本技术:
- How to filter SQL results in a has-many-through relation
它很特别,因为每个观众必须参加的节目列表是由给定的主要观众动态确定的。
假设(d_spectator, id_show)
在reservations
中是唯一的,这还没有弄清楚。
对这两列的 UNIQUE
约束(按此顺序)也提供了最重要的索引。
为了在下面的查询 2 和 3 中获得最佳性能,还创建了一个前导 id_show
.
的索引
1。蛮力
原始方法是形成给定用户看过的节目的排序数组,并比较其他用户的相同数组:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
@> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
但这对于大型 table 来说可能 非常昂贵 。整个 table 必须是流程,而且也是一种相当昂贵的方式。
2。更聪明
使用CTE来确定相关节目,然后只考虑那些
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
@>
is the "contains2 operator for arrays - 所以我们得到所有 至少 看过相同节目的观众。
比 1 快。 因为只考虑相关节目。
3。真聪明
要同时从查询中排除不打算提前获得资格的观众,请使用 recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
请注意,第一个 CTE 是 非递归的。只有第二部分是递归的(实际上是迭代的)。
这应该是从大 tables 中进行小选择的最快速度。不符合条件的行会提前排除。我提到的两个指数是必不可少的。
SQL Fiddle 演示了所有三个。
对于一项作业,我必须为存储在 PostgreSQL 服务器 运行 PostgreSQL 9.3.0 中的数据库编写几个 SQL 查询。但是,我发现自己被最后一个查询阻止了。该数据库为歌剧院的预订系统建模。查询是关于将每次协助同一事件的观众与其他观众相关联。
模型看起来像这样:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez@gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille@gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
到目前为止,我已经开始编写查询以获取观众的 ID 和他正在观看的节目的日期,查询如下所示。
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
有人可以帮助我更好地理解问题并提示我找到解决方案。提前致谢。
所以我期待的结果应该是这样的
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
这意味着每次观众 id 1 去看演出时,观众 2 和 3 也会去看。
听起来您已经完成了全部问题的一半——确定 id_shows 一个特定的 id_spectator 参加了哪一个。
您想问自己的是如何确定哪些 id_spectator 参加了 id_show,给定 id_show。一旦你有了它,将两个答案结合起来得到完整的结果。
基于评论的注释:想要明确说明这个答案可能用途有限,因为它是在 SQL-Server 的上下文中回答的(标签是当时在场)
可能有更好的方法,但您可以使用“填充”功能来完成。这里唯一的缺点是,由于您的 ID 是整数,因此在值之间放置逗号将涉及变通(需要是字符串)。以下是我能想到的解决方法。
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
这将向您显示观看同一节目的所有其他观众。
所以我得到的最终答案是这样的:
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
打印出如下内容:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
这符合我的需要,但是如果您有任何改进,请分享:)再次感谢大家!
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
换句话说,您需要一个列表...
所有看过给定观众看过的所有节目的观众(并且可能比给定的观众更多)
这是关系划分的特例。我们在这里汇集了一系列基本技术:
- How to filter SQL results in a has-many-through relation
它很特别,因为每个观众必须参加的节目列表是由给定的主要观众动态确定的。
假设(d_spectator, id_show)
在reservations
中是唯一的,这还没有弄清楚。
对这两列的 UNIQUE
约束(按此顺序)也提供了最重要的索引。
为了在下面的查询 2 和 3 中获得最佳性能,还创建了一个前导 id_show
.
1。蛮力
原始方法是形成给定用户看过的节目的排序数组,并比较其他用户的相同数组:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
@> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
但这对于大型 table 来说可能 非常昂贵 。整个 table 必须是流程,而且也是一种相当昂贵的方式。
2。更聪明
使用CTE来确定相关节目,然后只考虑那些
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
@>
is the "contains2 operator for arrays - 所以我们得到所有 至少 看过相同节目的观众。
比 1 快。 因为只考虑相关节目。
3。真聪明
要同时从查询中排除不打算提前获得资格的观众,请使用 recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
请注意,第一个 CTE 是 非递归的。只有第二部分是递归的(实际上是迭代的)。
这应该是从大 tables 中进行小选择的最快速度。不符合条件的行会提前排除。我提到的两个指数是必不可少的。
SQL Fiddle 演示了所有三个。