按 week/month//quarter/year 分区以克服分区限制?
Partition by week/month//quarter/year to get over the partition limit?
我有 32 年的数据要放入分区 table。但是 BigQuery 说我要超过限制(4000 个分区)。
对于这样的查询:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
我收到如下错误:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
我怎样才能超过这个限制?
您可以按 week/month/year 分区,而不是按天分区。
在我的例子中,每年的数据包含大约 3GB 的数据,所以如果我按年分区,我将从集群中获得最大的好处。
为此,我将创建一个 year
日期列,并按它进行分区:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
请注意,我在此过程中创建了额外的列 DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year
。
Table 数据:
Since the table is clustered, I'll get the benefits of partitioning 即使我不使用分区列(年份)作为过滤器:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
另一个例子,我创建了一个 NOAA GSOD 摘要 table 按站名聚类 - 而不是按天分区,我根本没有分区。
假设我想为名称如 SAN FRAN%
:
的所有电台查找自 1980 年以来最热的日子
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
请注意,我只处理了 55.2MB 的数据就得到了结果。
源 tables 上的等效查询(没有集群)改为处理 4GB:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
我还添加了地理集群 table,以按位置而不是站名搜索。在此处查看详细信息:
我有 32 年的数据要放入分区 table。但是 BigQuery 说我要超过限制(4000 个分区)。
对于这样的查询:
CREATE TABLE `deleting.day_partition`
PARTITION BY FlightDate
AS
SELECT *
FROM `flights.original`
我收到如下错误:
Too many partitions produced by query, allowed 2000, query produces at least 11384 partitions
我怎样才能超过这个限制?
您可以按 week/month/year 分区,而不是按天分区。
在我的例子中,每年的数据包含大约 3GB 的数据,所以如果我按年分区,我将从集群中获得最大的好处。
为此,我将创建一个 year
日期列,并按它进行分区:
CREATE TABLE `fh-bigquery.flights.ontime_201903`
PARTITION BY FlightDate_year
CLUSTER BY Origin, Dest
AS
SELECT *, DATE_TRUNC(FlightDate, YEAR) FlightDate_year
FROM `fh-bigquery.flights.raw_load_fixed`
请注意,我在此过程中创建了额外的列 DATE_TRUNC(FlightDate, YEAR) AS FlightDate_year
。
Table 数据:
Since the table is clustered, I'll get the benefits of partitioning 即使我不使用分区列(年份)作为过滤器:
SELECT *
FROM `fh-bigquery.flights.ontime_201903`
WHERE FlightDate BETWEEN '2008-01-01' AND '2008-01-10'
Predicted cost: 83.4 GB
Actual cost: 3.2 GB
另一个例子,我创建了一个 NOAA GSOD 摘要 table 按站名聚类 - 而不是按天分区,我根本没有分区。
假设我想为名称如 SAN FRAN%
:
SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
FROM `fh-bigquery.weather_gsod.all`
WHERE name LIKE 'SAN FRANC%'
AND date > '1980-01-01'
GROUP BY 1,2
ORDER BY active_until DESC
请注意,我只处理了 55.2MB 的数据就得到了结果。
源 tables 上的等效查询(没有集群)改为处理 4GB:
# query on non-clustered tables - too much data compared to the other one
SELECT name, state, ARRAY_AGG(STRUCT(CONCAT(a.year,a.mo,a.da),temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(CONCAT(a.year,a.mo,a.da)) active_until
FROM `bigquery-public-data.noaa_gsod.gsod*` a
JOIN `bigquery-public-data.noaa_gsod.stations` b
ON a.wban=b.wban AND a.stn=b.usaf
WHERE name LIKE 'SAN FRANC%'
AND _table_suffix >= '1980'
GROUP BY 1,2
ORDER BY active_until DESC
我还添加了地理集群 table,以按位置而不是站名搜索。在此处查看详细信息: