查找看过相同节目的观众(每个匹配多行)

Find spectators that have seen the same shows (match multiple rows for each)

对于一项作业,我必须为存储在 PostgreSQL 服务器 运行 PostgreSQL 9.3.0 中的数据库编写几个 SQL 查询。但是,我发现自己被最后一个查询阻止了。该数据库为歌剧院的预订系统建模。查询是关于将每次协助同一事件的观众与其他观众相关联。

模型看起来像这样:

Reservations table 
id_res |     create_date     |  tickets_presented  | id_show | id_spectator | price | category 
-------+---------------------+---------------------+---------+--------------+-------+----------
     1 | 2015-08-05 17:45:03 |                     |       1 |            1 |   195 |        1
     2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 |      11 |            1 |   150 |        2

Spectators table

id_spectator   | last_name  | first_name |                email                   |     create_time     | age 
---------------+------------+------------+----------------------------------------+---------------------+-----   
             1 | gonzalez   | colin      | colin.gonzalez@gmail.com               | 2014-03-15 14:21:30 |  22
             2 | bequet     | camille    | bequet.camille@gmail.com               | 2014-12-10 15:22:31 |  22

Shows table
 id_show |          name          |  kind  | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3 
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
       1 | madama butterfly       | opera  | 2015-09-05        | 19:30:00   | 21:30:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100
       2 | don giovanni           | opera  | 2015-09-12        | 19:30:00   | 21:45:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100

到目前为止,我已经开始编写查询以获取观众的 ID 和他正在观看的节目的日期,查询如下所示。

SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;

有人可以帮助我更好地理解问题并提示我找到解决方案。提前致谢。

所以我期待的结果应该是这样的

id_spectator | other_id_spectators
-------------+--------------------
            1|                 2,3

这意味着每次观众 id 1 去看演出时,观众 2 和 3 也会去看。

听起来您已经完成了全部问题的一半——确定 id_shows 一个特定的 id_spectator 参加了哪一个。

您想问自己的是如何确定哪些 id_spectator 参加了 id_show,给定 id_show。一旦你有了它,将两个答案结合起来得到完整的结果。

基于评论的注释:想要明确说明这个答案可能用途有限,因为它是在 SQL-Server 的上下文中回答的(标签是当时在场)

可能有更好的方法,但您可以使用“填充”功能来完成。这里唯一的缺点是,由于您的 ID 是整数,因此在值之间放置逗号将涉及变通(需要是字符串)。以下是我能想到的解决方法。

SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]

这将向您显示观看同一节目的所有其他观众。

所以我得到的最终答案是这样的:

SELECT id_spectator, id_show,(
    SELECT string_agg(to_char(A.id_spectator, '999'), ',')
    FROM Reservations A
    WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;

打印出如下内容:

id_spectator | id_show | other_id_spectators 
-------------+---------+---------------------
           1 |       1 |    1,   2,   9
           1 |      14 |    1,   2

这符合我的需要,但是如果您有任何改进,请分享:)再次感谢大家!

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

换句话说,您需要一个列表...
所有看过给定观众看过的所有节目的观众(并且可能比给定的观众更多)

这是关系划分的特例。我们在这里汇集了一系列基本技术:

  • How to filter SQL results in a has-many-through relation

它很特别,因为每个观众必须参加的节目列表是由给定的主要观众动态确定的。

假设(d_spectator, id_show)reservations中是唯一的,这还没有弄清楚。

对这两列的 UNIQUE 约束(按此顺序)也提供了最重要的索引。
为了在下面的查询 2 和 3 中获得最佳性能,还创建了一个前导 id_show.

的索引

1。蛮力

原始方法是形成给定用户看过的节目的排序数组,并比较其他用户的相同数组:

SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM  (
   SELECT id_spectator
   FROM   reservations r
   WHERE  id_spectator <> 1
   GROUP  BY 1
   HAVING        array_agg(id_show ORDER BY id_show)
      @> (SELECT array_agg(id_show ORDER BY id_show)
          FROM   reservations
          WHERE  id_spectator = 1)
   ) sub;

但这对于大型 table 来说可能 非常昂贵 。整个 table 必须是流程,而且也是一种相当昂贵的方式。

2。更聪明

使用CTE来确定相关节目,然后只考虑那些

WITH shows AS (             -- all shows of id 1; 1 row per show
   SELECT id_spectator, id_show
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   )
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM  (
   SELECT s.id_spectator, r.id_spectator AS other
   FROM   shows s
   JOIN   reservations r USING (id_show)
   WHERE  r.id_spectator <> s.id_spectator
   GROUP  BY 1,2
   HAVING count(*) = (SELECT count(*) FROM shows)
   ) sub
GROUP  BY 1;

@> is the "contains2 operator for arrays - 所以我们得到所有 至少 看过相同节目的观众。

1 快。 因为只考虑相关节目。

3。真聪明

要同时从查询中排除不打算提前获得资格的观众,请使用 recursive CTE:

WITH RECURSIVE shows AS (   -- produces exactly 1 row
   SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   GROUP  BY 1
   )
, cte AS (
   SELECT r.id_spectator, 1 AS idx
   FROM   shows s
   JOIN   reservations r ON r.id_show = s.shows[1]
   WHERE  r.id_spectator <> s.id_spectator

   UNION  ALL
   SELECT r.id_spectator, idx + 1
   FROM   cte c
   JOIN   reservations r USING (id_spectator)
   JOIN   shows s ON s.shows[c.idx + 1] = r.id_show
   )
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM   shows s
JOIN   cte c ON c.idx = s.ct  -- has an entry for every show
GROUP  BY 1;

请注意,第一个 CTE 是 非递归的。只有第二部分是递归的(实际上是迭代的)。

这应该是从大 tables 中进行小选择的最快速度。不符合条件的行会提前排除。我提到的两个指数是必不可少的。

SQL Fiddle 演示了所有三个。