大查询 distinct on 和 group by
Big query distinct on and group by
继 Select first row in each GROUP BY group? 之后,我正在尝试在 Google 大查询中做一件非常相似的事情。
数据集:fh-bigquery:reddit_comments.2018_01
Aim:对于每个 link_id(Reddit 提交)select created_utc
的第一个评论
SELECT body,link_id
FROM [fh-bigquery:reddit_comments.2018_01]
where subreddit_id == "t5_2zkvo"
group by link_id ,body, created_utc
order by link_id ,body, created_utc desc
目前它不起作用,因为它仍然没有给我 unique/distinct parent_id(s)
拜托,谢谢!
编辑:
我说 parent_id 是 == 提交是不正确的,它实际上是 link_id
我们可以在这里使用ROW_NUMBER()
:
SELECT body, parent_id, created_utc
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
FROM [fh-bigquery:reddit_comments.2018_01]
WHERE subreddit_id = 't5_2zkvo'
) t
WHERE rn = 1
ORDER BY parent_id ,body, created_utc DESC;
请注意,您可以继续使用当前的方法,但随后您必须将查询表述为 table 和为每个评论找到最早条目的子查询之间的连接:
SELECT t1.*
FROM [fh-bigquery:reddit_comments.2018_01] t1
INNER JOIN
(
SELECT parent_id, MIN(created_utc) AS first_created_utc
FROM [fh-bigquery:reddit_comments.2018_01]
GROUP BY parent_id
) t2
ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT
ARRAY_AGG(body ORDER BY created_utc LIMIT 1)[OFFSET(0)] body,
link_id
FROM `fh-bigquery.reddit_comments.2018_01`
WHERE subreddit_id = 't5_2zkvo'
GROUP BY link_id
-- ORDER BY link_id
继 Select first row in each GROUP BY group? 之后,我正在尝试在 Google 大查询中做一件非常相似的事情。
数据集:fh-bigquery:reddit_comments.2018_01
Aim:对于每个 link_id(Reddit 提交)select created_utc
的第一个评论SELECT body,link_id
FROM [fh-bigquery:reddit_comments.2018_01]
where subreddit_id == "t5_2zkvo"
group by link_id ,body, created_utc
order by link_id ,body, created_utc desc
目前它不起作用,因为它仍然没有给我 unique/distinct parent_id(s)
拜托,谢谢!
编辑: 我说 parent_id 是 == 提交是不正确的,它实际上是 link_id
我们可以在这里使用ROW_NUMBER()
:
SELECT body, parent_id, created_utc
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
FROM [fh-bigquery:reddit_comments.2018_01]
WHERE subreddit_id = 't5_2zkvo'
) t
WHERE rn = 1
ORDER BY parent_id ,body, created_utc DESC;
请注意,您可以继续使用当前的方法,但随后您必须将查询表述为 table 和为每个评论找到最早条目的子查询之间的连接:
SELECT t1.*
FROM [fh-bigquery:reddit_comments.2018_01] t1
INNER JOIN
(
SELECT parent_id, MIN(created_utc) AS first_created_utc
FROM [fh-bigquery:reddit_comments.2018_01]
GROUP BY parent_id
) t2
ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT
ARRAY_AGG(body ORDER BY created_utc LIMIT 1)[OFFSET(0)] body,
link_id
FROM `fh-bigquery.reddit_comments.2018_01`
WHERE subreddit_id = 't5_2zkvo'
GROUP BY link_id
-- ORDER BY link_id