查找看过相同节目的观众（每个匹配多行）

Question

对于一项作业，我必须为存储在 PostgreSQL 服务器运行 PostgreSQL 9.3.0 中的数据库编写几个 SQL 查询。但是，我发现自己被最后一个查询阻止了。该数据库为歌剧院的预订系统建模。查询是关于将每次协助同一事件的观众与其他观众相关联。

模型看起来像这样：

Reservations table 
id_res |     create_date     |  tickets_presented  | id_show | id_spectator | price | category 
-------+---------------------+---------------------+---------+--------------+-------+----------
     1 | 2015-08-05 17:45:03 |                     |       1 |            1 |   195 |        1
     2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 |      11 |            1 |   150 |        2

Spectators table

id_spectator   | last_name  | first_name |                email                   |     create_time     | age 
---------------+------------+------------+----------------------------------------+---------------------+-----   
             1 | gonzalez   | colin      | colin.gonzalez@gmail.com               | 2014-03-15 14:21:30 |  22
             2 | bequet     | camille    | bequet.camille@gmail.com               | 2014-12-10 15:22:31 |  22

Shows table
 id_show |          name          |  kind  | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3 
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
       1 | madama butterfly       | opera  | 2015-09-05        | 19:30:00   | 21:30:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100
       2 | don giovanni           | opera  | 2015-09-12        | 19:30:00   | 21:45:00 |         2 |           315 |           630 |           945 |        195 |        150 |        100

到目前为止，我已经开始编写查询以获取观众的 ID 和他正在观看的节目的日期，查询如下所示。

SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;

有人可以帮助我更好地理解问题并提示我找到解决方案。提前致谢。

所以我期待的结果应该是这样的

id_spectator | other_id_spectators
-------------+--------------------
            1|                 2,3

这意味着每次观众 id 1 去看演出时，观众 2 和 3 也会去看。

Answer 1

听起来您已经完成了全部问题的一半——确定 id_shows 一个特定的 id_spectator 参加了哪一个。

您想问自己的是如何确定哪些 id_spectator 参加了 id_show，给定 id_show。一旦你有了它，将两个答案结合起来得到完整的结果。

Answer 2

基于评论的注释：想要明确说明这个答案可能用途有限，因为它是在 SQL-Server 的上下文中回答的（标签是当时在场）

可能有更好的方法，但您可以使用“填充”功能来完成。这里唯一的缺点是，由于您的 ID 是整数，因此在值之间放置逗号将涉及变通（需要是字符串）。以下是我能想到的解决方法。

SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]

这将向您显示观看同一节目的所有其他观众。

Answer 3

所以我得到的最终答案是这样的：

SELECT id_spectator, id_show,(
    SELECT string_agg(to_char(A.id_spectator, '999'), ',')
    FROM Reservations A
    WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;

打印出如下内容：

id_spectator | id_show | other_id_spectators 
-------------+---------+---------------------
           1 |       1 |    1,   2,   9
           1 |      14 |    1,   2

这符合我的需要，但是如果您有任何改进，请分享:)再次感谢大家！

Answer 4

Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.

换句话说，您需要一个列表...
所有看过给定观众看过的所有节目的观众（并且可能比给定的观众更多）

这是关系划分的特例。我们在这里汇集了一系列基本技术：

How to filter SQL results in a has-many-through relation

它很特别，因为每个观众必须参加的节目列表是由给定的主要观众动态确定的。

假设(d_spectator, id_show)在reservations中是唯一的，这还没有弄清楚。

对这两列的 UNIQUE 约束（按此顺序）也提供了最重要的索引。
为了在下面的查询 2 和 3 中获得最佳性能，还创建了一个前导 id_show.

的索引

1。蛮力

原始方法是形成给定用户看过的节目的排序数组，并比较其他用户的相同数组：

SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM  (
   SELECT id_spectator
   FROM   reservations r
   WHERE  id_spectator <> 1
   GROUP  BY 1
   HAVING        array_agg(id_show ORDER BY id_show)
      @> (SELECT array_agg(id_show ORDER BY id_show)
          FROM   reservations
          WHERE  id_spectator = 1)
   ) sub;

但这对于大型 table 来说可能 非常昂贵 。整个 table 必须是流程，而且也是一种相当昂贵的方式。

2。更聪明

使用CTE来确定相关节目，然后只考虑那些

WITH shows AS (             -- all shows of id 1; 1 row per show
   SELECT id_spectator, id_show
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   )
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM  (
   SELECT s.id_spectator, r.id_spectator AS other
   FROM   shows s
   JOIN   reservations r USING (id_show)
   WHERE  r.id_spectator <> s.id_spectator
   GROUP  BY 1,2
   HAVING count(*) = (SELECT count(*) FROM shows)
   ) sub
GROUP  BY 1;

@> is the "contains2 operator for arrays - 所以我们得到所有至少看过相同节目的观众。

比 1 快。 因为只考虑相关节目。

3。真聪明

要同时从查询中排除不打算提前获得资格的观众，请使用 recursive CTE:

WITH RECURSIVE shows AS (   -- produces exactly 1 row
   SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
   FROM   reservations
   WHERE  id_spectator = 1  -- your prime spectator here
   GROUP  BY 1
   )
, cte AS (
   SELECT r.id_spectator, 1 AS idx
   FROM   shows s
   JOIN   reservations r ON r.id_show = s.shows[1]
   WHERE  r.id_spectator <> s.id_spectator

   UNION  ALL
   SELECT r.id_spectator, idx + 1
   FROM   cte c
   JOIN   reservations r USING (id_spectator)
   JOIN   shows s ON s.shows[c.idx + 1] = r.id_show
   )
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM   shows s
JOIN   cte c ON c.idx = s.ct  -- has an entry for every show
GROUP  BY 1;

请注意，第一个 CTE 是 非递归的。只有第二部分是递归的（实际上是迭代的）。

这应该是从大 tables 中进行小选择的最快速度。不符合条件的行会提前排除。我提到的两个指数是必不可少的。

SQL Fiddle 演示了所有三个。

查找看过相同节目的观众（每个匹配多行）

Find spectators that have seen the same shows (match multiple rows for each)

sql

postgresql

aggregate-functions

common-table-expression

relational-division

1。蛮力

2。更聪明

3。真聪明