当过滤该列的值来自子查询时,我如何从分区 table 的 BigQuery 中获益?
How can I get the benefits from a BigQuery partitioned table when the values to filter that column come out of a subquery?
我有这样的查询:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date_calc AS (
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-05'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date_calc) AND DATE_ADD((SELECT * FROM min_date_calc), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
此查询不 运行,或者不使用分区来 p运行e 查询数据 - 而是扫描整个 table。我可以在这里做什么?
(问题基于 comment on reddit)
此查询将仅扫描 2.86GB:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN '2018-01-01' AND '2018-01-08' # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
但那是因为我将日期作为常数。同样,像这样的查询也会执行得很好:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date AS (
SELECT DATE('2018-01-01')
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date) AND DATE_ADD((SELECT * FROM min_date), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
这很有效,因为日期已设为常量(即使在函数内部也是如此)。
要使用不是常量的值获得这些相同的好处,那么我们可以使用脚本来获得所有好处:
DECLARE min_date DATE;
SET min_date = (
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-04'
);
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN min_date AND DATE_ADD(min_date, INTERVAL 7 DAY) # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
现在我们看到两个查询:
- 第一个遍历了设置
min_date
的值所需的所有数据。
- 第二个和前一个一样使用 2.86GB - 但现在
min_date
的值来自动态查询。
我有这样的查询:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date_calc AS (
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-05'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date_calc) AND DATE_ADD((SELECT * FROM min_date_calc), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
此查询不 运行,或者不使用分区来 p运行e 查询数据 - 而是扫描整个 table。我可以在这里做什么?
(问题基于 comment on reddit)
此查询将仅扫描 2.86GB:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN '2018-01-01' AND '2018-01-08' # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
但那是因为我将日期作为常数。同样,像这样的查询也会执行得很好:
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
), min_date AS (
SELECT DATE('2018-01-01')
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN (SELECT * FROM min_date) AND DATE_ADD((SELECT * FROM min_date), INTERVAL 7 DAY)
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
这很有效,因为日期已设为常量(即使在函数内部也是如此)。
要使用不是常量的值获得这些相同的好处,那么我们可以使用脚本来获得所有好处:
DECLARE min_date DATE;
SET min_date = (
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT DATE(MIN(datehour)) FROM data WHERE datehour<'2018-01-04'
);
WITH data AS (
SELECT *
FROM `fh-bigquery.wikipedia_v3.pageviews_2018`
WHERE wiki='en'
AND title LIKE 'Goo%'
)
SELECT title, SUM(views) views
FROM data
WHERE
DATE(datehour) BETWEEN min_date AND DATE_ADD(min_date, INTERVAL 7 DAY) # 2.68 GB
GROUP BY 1 ORDER BY 2 DESC LIMIT 1
现在我们看到两个查询:
- 第一个遍历了设置
min_date
的值所需的所有数据。 - 第二个和前一个一样使用 2.86GB - 但现在
min_date
的值来自动态查询。