如何在 bigquery 中进行子查询?
how to do subqueries in bigquery?
我正在尝试使用 bigquery 上的 reddit 数据,我希望在一行中看到评论和回复。我看到 bigquery 支持子查询,但我无法构建查询。由于数据的结构,我必须使用子查询自行加入相同的 table,特别是我想将 id 和 parent_id 加入在一起,但我需要在加入之前修改 id。以下是我尝试查询的方式:
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p,
(
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
WHERE p.first_id = c.parent_id
AND p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]';
我得到的错误是:
Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?
您可以在此处运行查询:
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01
我不确定如何解决这个问题。加入此查询并将此查询发送至 运行 的正确方法是什么?
你可能想做如下的事情(只是猜测):
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.link_id = c.parent_id
WHERE p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]'
LIMIT 100
查看有关 JOIN 的更多信息
请注意,我只是将您的查询转换为正确使用 JOIN,但查询逻辑仍需要您根据需要进行完善
Added to address additional info in your comment:
SELECT
subreddit,
first_body,
first_score,
first_id ,
last_body,
last_score,
last_id
FROM (
SELECT
subreddit,
body AS first_body,
score AS first_score,
CONCAT('t1_',id) AS first_id
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE score > 1
AND author != '[deleted]'
AND body != '[deleted]'
) p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.first_id = c.parent_id
LIMIT 100
在 BigQuery 的 SQL 方言中,逗号表示 UNION ALL 而不是 JOIN。您需要使用 JOIN 关键字显式编写 JOIN。
我还建议将连接的两端都推入子查询,以确保在执行连接之前应用所有过滤器。 (到目前为止,连接是查询中成本最高的部分,因此首先应用过滤器将确保您的查询尽可能快地运行。)
我正在尝试使用 bigquery 上的 reddit 数据,我希望在一行中看到评论和回复。我看到 bigquery 支持子查询,但我无法构建查询。由于数据的结构,我必须使用子查询自行加入相同的 table,特别是我想将 id 和 parent_id 加入在一起,但我需要在加入之前修改 id。以下是我尝试查询的方式:
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p,
(
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
WHERE p.first_id = c.parent_id
AND p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]';
我得到的错误是:
Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?
您可以在此处运行查询: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01
我不确定如何解决这个问题。加入此查询并将此查询发送至 运行 的正确方法是什么?
你可能想做如下的事情(只是猜测):
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.link_id = c.parent_id
WHERE p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]'
LIMIT 100
查看有关 JOIN 的更多信息
请注意,我只是将您的查询转换为正确使用 JOIN,但查询逻辑仍需要您根据需要进行完善
Added to address additional info in your comment:
SELECT
subreddit,
first_body,
first_score,
first_id ,
last_body,
last_score,
last_id
FROM (
SELECT
subreddit,
body AS first_body,
score AS first_score,
CONCAT('t1_',id) AS first_id
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE score > 1
AND author != '[deleted]'
AND body != '[deleted]'
) p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.first_id = c.parent_id
LIMIT 100
在 BigQuery 的 SQL 方言中,逗号表示 UNION ALL 而不是 JOIN。您需要使用 JOIN 关键字显式编写 JOIN。
我还建议将连接的两端都推入子查询,以确保在执行连接之前应用所有过滤器。 (到目前为止,连接是查询中成本最高的部分,因此首先应用过滤器将确保您的查询尽可能快地运行。)