需要帮助查询热门帖子的热门 reddit 评论

Need help querying top reddit comments by top posts

我正在尝试收集给定 subreddit 中投票最多的 post 中投票最多的评论(比如说前 20 条)。

如有任何帮助,我们将不胜感激!

我已经找到了我在 bigquery 中使用的这段代码,但我似乎无法在没有重复问题的情况下获得 post 分数和评论分数。

SELECT posts.title, posts.score, comments.body, posts.subreddit
FROM `fh-bigquery.reddit_comments.2018_10` AS comments
JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
ON posts.id = SUBSTR(comments.link_id, 4) 
WHERE posts.subreddit = 'Showerthoughts'

对于一个简化的例子,我希望能够看到:

Post Title 1 | Post Score | (Within Post Title 1) Comment Body 1 | Comment Score

Post Title 1 | Post Score | (Within Post Title 1) Comment Body 2 | Comment Score

Post Title 2 | Post Score | (Within Post Title 2) Comment Body 1 | Comment Score

Post Title 2 | Post Score | (Within Post Title 2) Comment Body 2 | Comment Score

这是解决重复文本 blob 问题的快速方法:

select title, score, body, subreddit from (
    SELECT 
        to_hex(md5(posts.title)), 
        array_agg(posts.title)[offset(0)] as title, 
        array_agg(comments.body)[offset(0)] as body, 
        array_agg(posts.score)[offset(0)] as score, 
        array_agg(posts.subreddit)[offset(0)] as subreddit
    FROM `fh-bigquery.reddit_comments.2018_10` AS comments
    JOIN `fh-bigquery.reddit_posts.2018_10`  AS posts
    ON posts.id = SUBSTR(comments.link_id, 4) 
    WHERE posts.subreddit = 'Showerthoughts'
    group by 1
    order by 1
)

我们的想法是将昂贵的文本 blob 转换为 md5 哈希,然后使用唯一条目进行您的日常业务。您可以从这些不同的值中按照您想要的方式对内容进行排序。