计算大table中的未读新闻
Counting unread news in a large table
我得到了一个很常见的(至少我认为是)数据库结构:有消息
(News(id, source_id)
),每条新闻都有来源(Source(id, url)
)。来源通过 TopicSource(source_id, topic_id)
聚合到主题 (Topic(id, title)
)。此外,还有一些用户 (User(id, name)
) 可以通过 NewsRead(news_id, user_id)
将新闻标记为已读。这是一个清理问题的图表:
我想计算特定用户主题中的未读新闻。问题是 News
table 很大(10^6 - 10^7 行)。幸运的是,我不需要知道 exact 计数,在一个阈值返回这个阈值作为计数值后停止计数就可以了。
关注 this answer 一个 一个主题 我提出了以下查询:
SELECT t.topic_id, count(1) as unread_count
FROM (
SELECT 1, topic_id
FROM news n
JOIN topic_source t ON n.source_id = t.source_id
-- join news_read to filter already read news
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
WHERE t.topic_id = 3 AND r.user_id IS NULL
LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;
(query plan 1)。此查询在测试数据库上花费大约 50 毫秒,即 acceptable。
现在想要select 多个主题 的未读计数。我试过 select 那样:
SELECT
t.topic_id,
(SELECT count(1)
FROM (SELECT 1 FROM news n
JOIN topic_source tt ON n.source_id = tt.source_id
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
LIMIT 10 -- Threshold
) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;
(query plan 2)。但是由于我不知道的原因,测试数据大约需要 1.5 秒,而单个查询的总和应该大约为 0.2-0.3 秒。
我显然在这里遗漏了一些东西。第二次查询有错误吗?有没有更好(更快)的方法来 select 统计未读新闻?
附加信息:
- 这是一个fiddle with DB structure and queries.
- 我正在使用 PostgresSQL 10 和 SQLAlchemy(但原始 SQL 目前还可以)。
Table 尺寸:
News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6
UPD: 查询计划清楚地表明我搞砸了第二个查询。感谢任何提示。
UPD2: 我用横向连接尝试了这个查询,它应该只是 运行 每个 topic_id
的第一个(最快的)查询:
SELECT
id,
count(*)
FROM topic t
LEFT JOIN LATERAL (
SELECT ts.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source ts ON n.source_id = ts.source_id
WHERE ts.topic_id = t.id AND r.user_id IS NULL
LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;
(query plan 3)。但是 Pg planner 似乎对此有不同的看法 - 它 运行 非常慢的序列扫描和散列连接而不是索引扫描和合并连接。
经过一些基准测试后,我终于停止了简单的 UNION ALL 查询,它比我的数据的横向连接快十倍:
SELECT
p.topic_id,
count(*)
FROM (
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 4 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 10 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 12 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 16 AND r.user_id IS NULL
LIMIT 100
) t1
) p
GROUP BY p.topic_id;
这里的直觉是,通过明确指定 topic_id 可以为 Pg 规划器提供足够的信息来构建有效的计划。
从 SQLAlchemy
的角度来看,它非常简单:
# topic_ids, user_id are defined elsewhere, e.g.
# topic_ids = [4, 10, 12, 16]
# user_id = 1
for topic_id in topic_ids:
topic_query = (
db.session.query(News.id, TopicSource.topic_id)
.join(TopicSource, TopicSource.source_id == News.source_id)
# LEFT JOIN NewsRead table to filter only unreads
# (where News.user_id IS NULL)
.outerjoin(NewsRead,
and_(NewsRead.news_id == News.id,
NewsRead.user_id == user_id))
.filter(TopicSource.topic_id == topic_id,
NewsRead.user_id.is_(None))
.limit(100))
topic_queries.append(topic_query)
# Unite queries with UNION ALL
union_query = topic_queries[0].union_all(*topic_queries[1:])
# Groups query by `topic_id` and count unreads
counts = (union_query
# Using `with_entities(func.count())` to avoid
# a subquery. See link below for info:
# https://gist.github.com/hest/8798884
.with_entities(TopicSource.topic_id.label('topic_id'),
func.count().label('unread_count'))
.group_by(TopicSource.topic_id))
result = counts.all()
我得到了一个很常见的(至少我认为是)数据库结构:有消息
(News(id, source_id)
),每条新闻都有来源(Source(id, url)
)。来源通过 TopicSource(source_id, topic_id)
聚合到主题 (Topic(id, title)
)。此外,还有一些用户 (User(id, name)
) 可以通过 NewsRead(news_id, user_id)
将新闻标记为已读。这是一个清理问题的图表:
我想计算特定用户主题中的未读新闻。问题是 News
table 很大(10^6 - 10^7 行)。幸运的是,我不需要知道 exact 计数,在一个阈值返回这个阈值作为计数值后停止计数就可以了。
关注 this answer 一个 一个主题 我提出了以下查询:
SELECT t.topic_id, count(1) as unread_count
FROM (
SELECT 1, topic_id
FROM news n
JOIN topic_source t ON n.source_id = t.source_id
-- join news_read to filter already read news
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
WHERE t.topic_id = 3 AND r.user_id IS NULL
LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;
(query plan 1)。此查询在测试数据库上花费大约 50 毫秒,即 acceptable。
现在想要select 多个主题 的未读计数。我试过 select 那样:
SELECT
t.topic_id,
(SELECT count(1)
FROM (SELECT 1 FROM news n
JOIN topic_source tt ON n.source_id = tt.source_id
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
LIMIT 10 -- Threshold
) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;
(query plan 2)。但是由于我不知道的原因,测试数据大约需要 1.5 秒,而单个查询的总和应该大约为 0.2-0.3 秒。
我显然在这里遗漏了一些东西。第二次查询有错误吗?有没有更好(更快)的方法来 select 统计未读新闻?
附加信息:
- 这是一个fiddle with DB structure and queries.
- 我正在使用 PostgresSQL 10 和 SQLAlchemy(但原始 SQL 目前还可以)。
Table 尺寸:
News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6
UPD: 查询计划清楚地表明我搞砸了第二个查询。感谢任何提示。
UPD2: 我用横向连接尝试了这个查询,它应该只是 运行 每个 topic_id
的第一个(最快的)查询:
SELECT
id,
count(*)
FROM topic t
LEFT JOIN LATERAL (
SELECT ts.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source ts ON n.source_id = ts.source_id
WHERE ts.topic_id = t.id AND r.user_id IS NULL
LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;
(query plan 3)。但是 Pg planner 似乎对此有不同的看法 - 它 运行 非常慢的序列扫描和散列连接而不是索引扫描和合并连接。
经过一些基准测试后,我终于停止了简单的 UNION ALL 查询,它比我的数据的横向连接快十倍:
SELECT
p.topic_id,
count(*)
FROM (
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 4 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 10 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 12 AND r.user_id IS NULL
LIMIT 100
) t1
UNION ALL
SELECT *
FROM (
SELECT fs.topic_id
FROM news n
LEFT JOIN news_read r
ON (n.id = r.news_id AND r.user_id = 1)
JOIN topic_source fs ON n.source_id = fs.source_id
WHERE fs.topic_id = 16 AND r.user_id IS NULL
LIMIT 100
) t1
) p
GROUP BY p.topic_id;
这里的直觉是,通过明确指定 topic_id 可以为 Pg 规划器提供足够的信息来构建有效的计划。
从 SQLAlchemy
的角度来看,它非常简单:
# topic_ids, user_id are defined elsewhere, e.g.
# topic_ids = [4, 10, 12, 16]
# user_id = 1
for topic_id in topic_ids:
topic_query = (
db.session.query(News.id, TopicSource.topic_id)
.join(TopicSource, TopicSource.source_id == News.source_id)
# LEFT JOIN NewsRead table to filter only unreads
# (where News.user_id IS NULL)
.outerjoin(NewsRead,
and_(NewsRead.news_id == News.id,
NewsRead.user_id == user_id))
.filter(TopicSource.topic_id == topic_id,
NewsRead.user_id.is_(None))
.limit(100))
topic_queries.append(topic_query)
# Unite queries with UNION ALL
union_query = topic_queries[0].union_all(*topic_queries[1:])
# Groups query by `topic_id` and count unreads
counts = (union_query
# Using `with_entities(func.count())` to avoid
# a subquery. See link below for info:
# https://gist.github.com/hest/8798884
.with_entities(TopicSource.topic_id.label('topic_id'),
func.count().label('unread_count'))
.group_by(TopicSource.topic_id))
result = counts.all()