BigQuery - select 来自大型 table 每个 subreddit 的前 N 个帖子
BigQuery - select top N posts from a large table for each subreddit
我正在 Google BigQuery 上对 Reddit 数据进行数据挖掘,我想根据整个 201704 数据的每个 subreddit 的分数排名前 1000 篇帖子。我尝试了不同的技术,但由于 BigQuery 的限制,结果太大 return。
select body, score, subreddit from
(
select body, score, subreddit,row_number() over
(
partition by subreddit order by score desc
) mm
from [fh-bigquery:reddit_comments.2017_04]
)
where mm <= 1000 AND subreddit in
(
select subreddit from
(
select Count(subreddit) as counts, subreddit from
[fh-bigquery:reddit_comments.2017_04] GROUP BY subreddit ORDER BY counts DESC
LIMIT 10000
)
)
LIMIT 10000000
有没有办法分而治之这个问题,因为启用大查询结果意味着不能做任何复杂的查询。 Google 是否为大型查询资源提供支付选项?
I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data
我刚刚测试了这个查询:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY 1
它在 22 秒内处理了整个数据集:
在您的查询中,您似乎想要前 10000 个最受欢迎的 subreddits 的帖子和分数。我试过这个查询:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
WHERE subreddit IN(
SELECT subreddit FROM(
SELECT
subreddit
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY subreddit
ORDER BY count(body) DESC
LIMIT 10000)
)
GROUP BY 1
并在 26 秒内得到结果:
希望这些结果就是您要找的。让我知道是否一切正确。
我正在 Google BigQuery 上对 Reddit 数据进行数据挖掘,我想根据整个 201704 数据的每个 subreddit 的分数排名前 1000 篇帖子。我尝试了不同的技术,但由于 BigQuery 的限制,结果太大 return。
select body, score, subreddit from
(
select body, score, subreddit,row_number() over
(
partition by subreddit order by score desc
) mm
from [fh-bigquery:reddit_comments.2017_04]
)
where mm <= 1000 AND subreddit in
(
select subreddit from
(
select Count(subreddit) as counts, subreddit from
[fh-bigquery:reddit_comments.2017_04] GROUP BY subreddit ORDER BY counts DESC
LIMIT 10000
)
)
LIMIT 10000000
有没有办法分而治之这个问题,因为启用大查询结果意味着不能做任何复杂的查询。 Google 是否为大型查询资源提供支付选项?
I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data
我刚刚测试了这个查询:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY 1
它在 22 秒内处理了整个数据集:
在您的查询中,您似乎想要前 10000 个最受欢迎的 subreddits 的帖子和分数。我试过这个查询:
SELECT
subreddit,
ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
WHERE subreddit IN(
SELECT subreddit FROM(
SELECT
subreddit
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY subreddit
ORDER BY count(body) DESC
LIMIT 10000)
)
GROUP BY 1
并在 26 秒内得到结果:
希望这些结果就是您要找的。让我知道是否一切正确。