为每个等级取一批记录,然后加入,然后在 postgres 中限制 1
Take batches of records for each rank, then JOIN, then LIMIT 1 in postgres
我正在尝试提高查询的性能。从 EXPLAIN ANALYZE
我了解到我的查询在我认为没有必要时考虑了太多 songs
记录。
有三个tableartists(artist_id, score)
、songs(song_id, artist_id)
和listened(song_id)
。
我当前的查询如下所示:
WITH artists_ranked AS (
SELECT
artist_id
, rank() OVER (ORDER BY score ) rnk
ORDER BY rnk ASC
),
not_listened_songs AS (
SELECT *
FROM songs
WHERE NOT EXISTS(
SELECT 1
FROM listened
WHERE listened.song_id = songs.song_id) -- bad: I go through all songs
),
shuffled_songs AS (
SELECT *
FROM artists_ranked
JOIN not_listened_songs ON not_listened_songs.artist_id = artists_ranked.artist_id
ORDER BY random() --bad: I shuffle all songs
)
SELECT DISTINCT ON (artist_id) *
FROM shuffled_songs
LIMIT 1;
理想情况下(至少在我看来),查询应遵循以下步骤:
- 按评分对
artists
table 进行排名。
拿一批评分最高的艺术家。可以是一位或多位艺术家。
加入 table songs
,但排除已经 listened
首歌曲。
现在我们想随机挑选一首歌曲,给每位艺术家平等的机会。
ORDER BY random()
、DISTINCT BY (artist_id)
、LIMIT 1
如果有这样的歌,我们就停下来return。否则,选择下一批艺术家(最接近的较低等级)并重复这些步骤。
- 要停止,要么 return 编辑一首歌(很可能在几次迭代之后),要么考虑所有艺术家。
谢谢。
从关系代数而非循环的角度思考问题。
要获取尚未播放的歌曲,请加入 artists
到 songs
,其中 song_id
不存在于 listened
中。按分数降序排列,首先从评分最高的艺术家那里获得歌曲,然后在每个分数内随机洗牌。限制为 1 条记录。
SELECT song_id
FROM artists a
JOIN songs s ON s.artist_id = a.artist_id
WHERE NOT EXISTS (SELECT TRUE FROM listened l WHERE l.song_id = s.song_id)
ORDER BY score DESC, RANDOM()
LIMIT 1
Can we give equal chance to each top score artist by considering equal amount of songs. Artists can have different number of songs. If there are 2 artists with top score and one has 100 songs, the other 1 song, then the probability to pick a song from the second artist is 0.01, but it should be 0.5
这对每位艺术家尚未收听的歌曲进行随机排序,然后呈现按分数降序排序的最终结果,然后按歌曲排名排序,这实际上交错了来自同一排名的所有艺术家的随机歌曲:
SELECT song_id
FROM artists a
NATURAL JOIN songs s
WHERE NOT EXISTS (
SELECT TRUE
FROM listened l
WHERE l.song_id = s.song_id
)
ORDER BY score DESC
, ROW_NUMBER() OVER (PARTITION BY artist_id ORDER BY RANDOM())
, FIRST_VALUE(RANDOM()) OVER (PARTITION BY artist_id)
我会尝试使用 LATERAL JOIN
让引擎按照 score
顺序逐一查看艺术家。
将 artist_id
添加到 listened
table 以避免额外加入并限制一次仅搜索一位艺术家。
向 table 添加索引。拥有这些指标很重要。
artists (score, artist_id)
songs (artist_id, song_id)
listened (artist_id, song_id)
查询
SELECT
artists.artist_id
,s.song_id
FROM
artists
INNER JOIN LATERAL
(
SELECT songs.song_id
FROM songs
WHERE
songs.artist_id = artists.artist_id
AND NOT EXISTS
(
SELECT 1
FROM listened
WHERE
listened.artist_id = songs.artist_id
-- limit listened songs to one artist
AND listened.song_id = songs.song_id
)
ORDER BY random()
-- shuffle only songs of one artist
LIMIT 1
) AS s ON true
ORDER BY artists.score ASC, random()
-- if there are several artists with the same score
-- pick one random artist among them
LIMIT 1;
查询将选择顶级艺术家,随机播放其歌曲,选择下一个顶级艺术家,随机播放他的歌曲,依此类推。
当艺术家有歌曲要播放时,这个查询应该会很快,并且会变得越来越慢,它会遍历顶级艺术家列表到排名较低的行。
如果 score
不是唯一的,那么 ORDER BY score LIMIT 1
将 return 一个 "random" 行与最高分。未定义将选择哪位艺术家。它不是严格随机的,只是没有定义。它可以在每次查询运行时更改或保持不变。要使其真正随机,只需明确添加 random()
。
通过此添加,查询将以相同的概率在几位得分相同的艺术家之间进行选择,而不管他们有多少首歌曲。
您可以扩展查询以使其考虑 "batches" 位顶级 N
艺术家,而不仅仅是每次一位顶级艺术家:
WITH
CTE
AS
(
SELECT
artists.artist_id
,s.song_id
FROM
artists
INNER JOIN LATERAL
(
SELECT songs.song_id
FROM songs
WHERE
songs.artist_id = artists.artist_id
AND NOT EXISTS
(
SELECT 1
FROM listened
WHERE
listened.artist_id = songs.artist_id
-- limit listened songs to one artist
AND listened.song_id = songs.song_id
)
ORDER BY random()
-- shuffle only songs of one artist
LIMIT 1
) AS s ON true
ORDER BY artists.score ASC
LIMIT 5 -- pick top N artists, N = 5
)
SELECT
artist_id
,song_id
FROM CTE
ORDER BY random() -- shuffle top N artists
LIMIT 1 -- pick one random artist out of top N
我正在尝试提高查询的性能。从 EXPLAIN ANALYZE
我了解到我的查询在我认为没有必要时考虑了太多 songs
记录。
有三个tableartists(artist_id, score)
、songs(song_id, artist_id)
和listened(song_id)
。
我当前的查询如下所示:
WITH artists_ranked AS (
SELECT
artist_id
, rank() OVER (ORDER BY score ) rnk
ORDER BY rnk ASC
),
not_listened_songs AS (
SELECT *
FROM songs
WHERE NOT EXISTS(
SELECT 1
FROM listened
WHERE listened.song_id = songs.song_id) -- bad: I go through all songs
),
shuffled_songs AS (
SELECT *
FROM artists_ranked
JOIN not_listened_songs ON not_listened_songs.artist_id = artists_ranked.artist_id
ORDER BY random() --bad: I shuffle all songs
)
SELECT DISTINCT ON (artist_id) *
FROM shuffled_songs
LIMIT 1;
理想情况下(至少在我看来),查询应遵循以下步骤:
- 按评分对
artists
table 进行排名。 拿一批评分最高的艺术家。可以是一位或多位艺术家。
加入 table
songs
,但排除已经listened
首歌曲。现在我们想随机挑选一首歌曲,给每位艺术家平等的机会。
ORDER BY random()
、DISTINCT BY (artist_id)
、LIMIT 1
如果有这样的歌,我们就停下来return。否则,选择下一批艺术家(最接近的较低等级)并重复这些步骤。
- 要停止,要么 return 编辑一首歌(很可能在几次迭代之后),要么考虑所有艺术家。
谢谢。
从关系代数而非循环的角度思考问题。
要获取尚未播放的歌曲,请加入 artists
到 songs
,其中 song_id
不存在于 listened
中。按分数降序排列,首先从评分最高的艺术家那里获得歌曲,然后在每个分数内随机洗牌。限制为 1 条记录。
SELECT song_id
FROM artists a
JOIN songs s ON s.artist_id = a.artist_id
WHERE NOT EXISTS (SELECT TRUE FROM listened l WHERE l.song_id = s.song_id)
ORDER BY score DESC, RANDOM()
LIMIT 1
Can we give equal chance to each top score artist by considering equal amount of songs. Artists can have different number of songs. If there are 2 artists with top score and one has 100 songs, the other 1 song, then the probability to pick a song from the second artist is 0.01, but it should be 0.5
这对每位艺术家尚未收听的歌曲进行随机排序,然后呈现按分数降序排序的最终结果,然后按歌曲排名排序,这实际上交错了来自同一排名的所有艺术家的随机歌曲:
SELECT song_id
FROM artists a
NATURAL JOIN songs s
WHERE NOT EXISTS (
SELECT TRUE
FROM listened l
WHERE l.song_id = s.song_id
)
ORDER BY score DESC
, ROW_NUMBER() OVER (PARTITION BY artist_id ORDER BY RANDOM())
, FIRST_VALUE(RANDOM()) OVER (PARTITION BY artist_id)
我会尝试使用 LATERAL JOIN
让引擎按照 score
顺序逐一查看艺术家。
将 artist_id
添加到 listened
table 以避免额外加入并限制一次仅搜索一位艺术家。
向 table 添加索引。拥有这些指标很重要。
artists (score, artist_id)
songs (artist_id, song_id)
listened (artist_id, song_id)
查询
SELECT
artists.artist_id
,s.song_id
FROM
artists
INNER JOIN LATERAL
(
SELECT songs.song_id
FROM songs
WHERE
songs.artist_id = artists.artist_id
AND NOT EXISTS
(
SELECT 1
FROM listened
WHERE
listened.artist_id = songs.artist_id
-- limit listened songs to one artist
AND listened.song_id = songs.song_id
)
ORDER BY random()
-- shuffle only songs of one artist
LIMIT 1
) AS s ON true
ORDER BY artists.score ASC, random()
-- if there are several artists with the same score
-- pick one random artist among them
LIMIT 1;
查询将选择顶级艺术家,随机播放其歌曲,选择下一个顶级艺术家,随机播放他的歌曲,依此类推。
当艺术家有歌曲要播放时,这个查询应该会很快,并且会变得越来越慢,它会遍历顶级艺术家列表到排名较低的行。
如果 score
不是唯一的,那么 ORDER BY score LIMIT 1
将 return 一个 "random" 行与最高分。未定义将选择哪位艺术家。它不是严格随机的,只是没有定义。它可以在每次查询运行时更改或保持不变。要使其真正随机,只需明确添加 random()
。
通过此添加,查询将以相同的概率在几位得分相同的艺术家之间进行选择,而不管他们有多少首歌曲。
您可以扩展查询以使其考虑 "batches" 位顶级 N
艺术家,而不仅仅是每次一位顶级艺术家:
WITH
CTE
AS
(
SELECT
artists.artist_id
,s.song_id
FROM
artists
INNER JOIN LATERAL
(
SELECT songs.song_id
FROM songs
WHERE
songs.artist_id = artists.artist_id
AND NOT EXISTS
(
SELECT 1
FROM listened
WHERE
listened.artist_id = songs.artist_id
-- limit listened songs to one artist
AND listened.song_id = songs.song_id
)
ORDER BY random()
-- shuffle only songs of one artist
LIMIT 1
) AS s ON true
ORDER BY artists.score ASC
LIMIT 5 -- pick top N artists, N = 5
)
SELECT
artist_id
,song_id
FROM CTE
ORDER BY random() -- shuffle top N artists
LIMIT 1 -- pick one random artist out of top N