BigQuery/SQL:次要 table 指示的间隔总和
BigQuery/SQL: Sum over intervals indicated by a secondary table
假设我有两个 table:intervals
包含索引区间(其列为 i_min
和 i_max
)并且 values
包含索引值(列 i
和 x
)。这是一个例子:
values: intervals:
+---+---+ +-------+-------+
| i | x | | i_min | i_max |
+-------+ +---------------+
| 1 | 1 | | 1 | 4 |
| 2 | 0 | | 6 | 6 |
| 3 | 4 | | 6 | 6 |
| 4 | 9 | | 6 | 6 |
| 6 | 7 | | 7 | 9 |
| 7 | 2 | | 12 | 17 |
| 8 | 2 | +-------+-------+
| 9 | 2 |
+---+---+
我想对每个区间的 x 值求和:
result:
+-------+-------+-----+
| i_min | i_max | sum |
+---------------------+
| 1 | 4 | 13 | // 1+0+4+9
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 7 | 9 | 6 | // 2+2+2
| 12 | 17 | 0 |
+-------+-------+-----+
在一些 SQL 引擎中,这可以通过以下方式完成:
SELECT
i_min,
i_max,
(SELECT SUM(x)
FROM values
WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
intervals
除了 BigQuery 不允许的查询类型("Subselect not allowed in SELECT clause." 或 "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." 取决于所使用的语法)。
一定有办法用 window 函数来做到这一点,但我不知道怎么做 — 我见过的所有示例都将分区作为 table 的一部分。有没有不使用CROSS JOIN的选项?如果不是,执行此 CROSS JOIN 的最有效方法是什么?
关于我的数据的一些注释:
- 两个 table 都包含许多 (10⁸-10⁹) 行。
intervals
中可能有重复,i
中没有。
- 但是
intervals
中的两个间隔要么相同,要么完全不相交(无重叠)。
- 所有区间的并集通常接近
i
的所有值的集合(因此它形成了这个space的分区)。
- 间隔可能很大(例如,i_max-i_min < 10⁶)。
尝试以下 - BigQuery 标准 SQL
#standardSQL
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (
SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
) AS intervals
JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min
您可以play/test使用如下虚拟数据
#standardSQL
WITH intervals AS (
SELECT 1 AS i_min, 4 AS i_max UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 7, 9 UNION ALL
SELECT 12, 17
),
values AS (
SELECT 1 AS i, 1 AS x UNION ALL
SELECT 2, 0 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 4, 9 UNION ALL
SELECT 6, 7 UNION ALL
SELECT 7, 2 UNION ALL
SELECT 8, 2 UNION ALL
SELECT 9, 2
)
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min
假设我有两个 table:intervals
包含索引区间(其列为 i_min
和 i_max
)并且 values
包含索引值(列 i
和 x
)。这是一个例子:
values: intervals:
+---+---+ +-------+-------+
| i | x | | i_min | i_max |
+-------+ +---------------+
| 1 | 1 | | 1 | 4 |
| 2 | 0 | | 6 | 6 |
| 3 | 4 | | 6 | 6 |
| 4 | 9 | | 6 | 6 |
| 6 | 7 | | 7 | 9 |
| 7 | 2 | | 12 | 17 |
| 8 | 2 | +-------+-------+
| 9 | 2 |
+---+---+
我想对每个区间的 x 值求和:
result:
+-------+-------+-----+
| i_min | i_max | sum |
+---------------------+
| 1 | 4 | 13 | // 1+0+4+9
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 6 | 6 | 7 |
| 7 | 9 | 6 | // 2+2+2
| 12 | 17 | 0 |
+-------+-------+-----+
在一些 SQL 引擎中,这可以通过以下方式完成:
SELECT
i_min,
i_max,
(SELECT SUM(x)
FROM values
WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
intervals
除了 BigQuery 不允许的查询类型("Subselect not allowed in SELECT clause." 或 "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." 取决于所使用的语法)。
一定有办法用 window 函数来做到这一点,但我不知道怎么做 — 我见过的所有示例都将分区作为 table 的一部分。有没有不使用CROSS JOIN的选项?如果不是,执行此 CROSS JOIN 的最有效方法是什么?
关于我的数据的一些注释:
- 两个 table 都包含许多 (10⁸-10⁹) 行。
intervals
中可能有重复,i
中没有。- 但是
intervals
中的两个间隔要么相同,要么完全不相交(无重叠)。 - 所有区间的并集通常接近
i
的所有值的集合(因此它形成了这个space的分区)。 - 间隔可能很大(例如,i_max-i_min < 10⁶)。
尝试以下 - BigQuery 标准 SQL
#standardSQL
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (
SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
) AS intervals
JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min
您可以play/test使用如下虚拟数据
#standardSQL
WITH intervals AS (
SELECT 1 AS i_min, 4 AS i_max UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 6, 6 UNION ALL
SELECT 7, 9 UNION ALL
SELECT 12, 17
),
values AS (
SELECT 1 AS i, 1 AS x UNION ALL
SELECT 2, 0 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 4, 9 UNION ALL
SELECT 6, 7 UNION ALL
SELECT 7, 2 UNION ALL
SELECT 8, 2 UNION ALL
SELECT 9, 2
)
SELECT
i_min, i_max, SUM(x) AS sum_x
FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL
GROUP BY i_min, i_max, line
-- ORDER BY i_min