如何从相邻 table 中动态数量的相关行获取聚合数据
How to get aggregate data from a dynamic number of related rows in adjacent table
我打了 table 场比赛,大致是这样的:
player_id | match_id | result | opponent_rank
----------------------------------------------
82 | 2847 | w | 42
82 | 3733 | w | 185
82 | 4348 | l | 10
82 | 5237 | w | 732
82 | 5363 | w | 83
82 | 7274 | w | 6
51 | 2347 | w | 39
51 | 3746 | w | 394
51 | 5037 | l | 90
... | ... | ... | ...
要获得所有连胜记录(不仅仅是任何玩家的最高连胜记录),我使用此查询:
SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
WHERE matches_m.opponent_position<'100'
WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
) AS w2
WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
) AS w1
WHERE w1.win = true
GROUP BY w1.player_id, w1.grp
ORDER BY w1.player_id DESC, count(*) DESC
) AS streaks
ORDER BY streaks.streak DESC
LIMIT 100
) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id
结果如下所示(请注意,这不是固定的 table/view,因为上面的查询可以通过某些参数进行扩展,例如国籍、日期范围、玩家排名等):
player_id | match_id | streak
-------------------------------
82 | 3733 | 2
82 | 7274 | 3
51 | 3746 | 2
... | ... | ...
我现在要添加的是一堆汇总数据,以提供有关连胜的详细信息。对于初学者,我想知道在每一次连胜中对手的平均排名。其他数据包括连胜的持续时间、第一个和最后一个日期、结束连胜的对手姓名或是否仍在进行中,等等。我尝试过各种方法——CTE、一些精心设计的连接、联合,或将它们作为滞后函数添加到现有代码中。但我完全不知道如何解决这个问题。
从代码中可以看出,我的SQL技能很基础,所以如果有错误或低效的陈述,请原谅。对于完整的上下文,我在 Debian 上使用 Postgres 9.4,matches_m table 是一个具有 550k 行的物化视图(查询现在需要 2.5 秒)。数据来自http://aligulac.com/about/db/,我只是镜像它来创建上述视图。
您需要获取 所有 行以获得最高连胜,而不是聚合行。
这 return 包含详细信息的前 100 条连胜(将 return 所有连胜改为 n 会更容易)。
SELECT ....
FROM
(
SELECT streaks.*,
-- used to filter the top 100 streaks
-- (would be more efficient without by filtering streaks only in Where)
Dense_Rank()
Over (ORDER BY streak DESC, grp, player_id) AS topStreak
FROM
(
SELECT w1.*,
Count(*)
Over (PARTITION BY player_id, grp) AS streak -- count wins per streak
FROM
( -- simplified assigning the group numbers to a single Cumulative Sum
SELECT m.player_id, m.match_id, m.win, m.DATE, --additional columns needed
-- cumulative sum over 0/1, doesn't increase for wins, i.e. a streak of wins gets the same number
Sum(CASE WHEN win = False THEN 1 ELSE 0 end)
Over(PARTITION BY m.player_id
ORDER BY DATE, match_id
ROWS Unbounded Preceding) AS grp
FROM matches_m AS m
WHERE matches_m.opponent_position<'100' -- should be <100 if it's an INT
) AS w1
WHERE w1.win = True -- remove the losses
) AS streaks
-- restrict the number of rows processed by the DENSE_RANK
-- (could be used instead of DENSE_RANK + WHERE topStreak <= 100)
WHERE streak > 20
) AS s
WHERE topStreak <= 100
现在您可以对这些条纹应用任何类型的分析。由于 PG 不是我的主要 DBMS,我不知道使用数组或 Window 函数(如 last_value(opponent_player_id) over ...
是否更容易
我想这就是你想要的。
关键思想是为每个连胜分配一个 "streak group",这样您就可以将它们汇总起来。你可以通过观察来做到这一点:
- 连胜的比赛显然是"win"。
- 可以通过计算之前的失败次数来识别连胜 -- 这对于连胜是不变的。
Postgres 在 9.4 中引入了 filter
子句,这使得语法更简单一些:
select player_id, count(*) as streak_length,
avg(opponent_rank) as avg_opponent_rank
from (select m.*,
count(*) filter (where result = 'l') over (partition by player_id order by date) as streak_grp
from matches_m m
) m
where result = 'w'
group by player_id, streak_grp;
我打了 table 场比赛,大致是这样的:
player_id | match_id | result | opponent_rank
----------------------------------------------
82 | 2847 | w | 42
82 | 3733 | w | 185
82 | 4348 | l | 10
82 | 5237 | w | 732
82 | 5363 | w | 83
82 | 7274 | w | 6
51 | 2347 | w | 39
51 | 3746 | w | 394
51 | 5037 | l | 90
... | ... | ... | ...
要获得所有连胜记录(不仅仅是任何玩家的最高连胜记录),我使用此查询:
SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
WHERE matches_m.opponent_position<'100'
WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
) AS w2
WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
) AS w1
WHERE w1.win = true
GROUP BY w1.player_id, w1.grp
ORDER BY w1.player_id DESC, count(*) DESC
) AS streaks
ORDER BY streaks.streak DESC
LIMIT 100
) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id
结果如下所示(请注意,这不是固定的 table/view,因为上面的查询可以通过某些参数进行扩展,例如国籍、日期范围、玩家排名等):
player_id | match_id | streak
-------------------------------
82 | 3733 | 2
82 | 7274 | 3
51 | 3746 | 2
... | ... | ...
我现在要添加的是一堆汇总数据,以提供有关连胜的详细信息。对于初学者,我想知道在每一次连胜中对手的平均排名。其他数据包括连胜的持续时间、第一个和最后一个日期、结束连胜的对手姓名或是否仍在进行中,等等。我尝试过各种方法——CTE、一些精心设计的连接、联合,或将它们作为滞后函数添加到现有代码中。但我完全不知道如何解决这个问题。
从代码中可以看出,我的SQL技能很基础,所以如果有错误或低效的陈述,请原谅。对于完整的上下文,我在 Debian 上使用 Postgres 9.4,matches_m table 是一个具有 550k 行的物化视图(查询现在需要 2.5 秒)。数据来自http://aligulac.com/about/db/,我只是镜像它来创建上述视图。
您需要获取 所有 行以获得最高连胜,而不是聚合行。
这 return 包含详细信息的前 100 条连胜(将 return 所有连胜改为 n 会更容易)。
SELECT ....
FROM
(
SELECT streaks.*,
-- used to filter the top 100 streaks
-- (would be more efficient without by filtering streaks only in Where)
Dense_Rank()
Over (ORDER BY streak DESC, grp, player_id) AS topStreak
FROM
(
SELECT w1.*,
Count(*)
Over (PARTITION BY player_id, grp) AS streak -- count wins per streak
FROM
( -- simplified assigning the group numbers to a single Cumulative Sum
SELECT m.player_id, m.match_id, m.win, m.DATE, --additional columns needed
-- cumulative sum over 0/1, doesn't increase for wins, i.e. a streak of wins gets the same number
Sum(CASE WHEN win = False THEN 1 ELSE 0 end)
Over(PARTITION BY m.player_id
ORDER BY DATE, match_id
ROWS Unbounded Preceding) AS grp
FROM matches_m AS m
WHERE matches_m.opponent_position<'100' -- should be <100 if it's an INT
) AS w1
WHERE w1.win = True -- remove the losses
) AS streaks
-- restrict the number of rows processed by the DENSE_RANK
-- (could be used instead of DENSE_RANK + WHERE topStreak <= 100)
WHERE streak > 20
) AS s
WHERE topStreak <= 100
现在您可以对这些条纹应用任何类型的分析。由于 PG 不是我的主要 DBMS,我不知道使用数组或 Window 函数(如 last_value(opponent_player_id) over ...
我想这就是你想要的。
关键思想是为每个连胜分配一个 "streak group",这样您就可以将它们汇总起来。你可以通过观察来做到这一点:
- 连胜的比赛显然是"win"。
- 可以通过计算之前的失败次数来识别连胜 -- 这对于连胜是不变的。
Postgres 在 9.4 中引入了 filter
子句,这使得语法更简单一些:
select player_id, count(*) as streak_length,
avg(opponent_rank) as avg_opponent_rank
from (select m.*,
count(*) filter (where result = 'l') over (partition by player_id order by date) as streak_grp
from matches_m m
) m
where result = 'w'
group by player_id, streak_grp;