BigQuery:合并两个基于联合 Google 电子表格的不同表

BigQuery: Union two different tables which are based on federated Google Spreadsheet

我有两个不同的 Google 电子表格:

一个有 4 列

+------+------+------+------+
| Col1 | Col2 | Col5 | Col6 |
+------+------+------+------+
| ID1  | A    | B    | C    |
| ID2  | D    | E    | F    |
+------+------+------+------+

一个包含上一个文件的 4 列,以及另外 2 列

+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+------+------+------+------+
| ID3  | G    | H    | J    | K    | L    |
| ID4  | M    | N    | O    | P    | Q    |
+------+------+------+------+------+------+

我在 Google BigQuery 中将它们配置为联合源,现在我需要创建一个视图来连接两个 table 的数据。

两个 table 都有 Col1 列,其中包含一个 ID,此 ID 在所有 table 中都是唯一的,不包含复制数据。

我要找的结果table是下面这个:

+------+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
+------+------+------+------+------+------+
| ID1  | A    | NULL | NULL | B    | C    |
| ID2  | D    | NULL | NULL | E    | F    |
| ID3  | G    | H    | J    | K    | L    |
| ID4  | M    | N    | O    | P    | Q    |
+------+------+------+------+------+------+

对于第一个文件没有的列,我期待一个 NULL 值。

我使用的是标准 SQL,这里有一条语句可以用来生成示例数据:

#standardsQL

WITH table1 AS (
  SELECT "A" as Col1, "B" as Col2, "C" AS Col3
  UNION ALL
  SELECT "D" as Col1, "E" as Col2, "F" AS Col3
),

table2 AS (
  SELECT "G" as Col1, "H" as Col2, "J" AS Col3, "K" AS Col4, "L" AS Col5
  UNION ALL
  SELECT "M" as Col1, "N" as Col2, "O" AS Col3, "P" AS Col4, "Q" AS Col5
)

简单的 UNION ALL 不起作用,因为 table 有不同的列

SELECT * FROM table1
UNION ALL
SELECT * FROM table2

Error: Queries in UNION ALL have mismatched column count; query 1 has 3 columns, query 2 has 5 columns at [17:1]

并且通配符不是 suitable 方式,因为 Federated sources 不支持

SELECT * FROM `table*`

Error: External tables cannot be queried through prefix

当然这是样本数据,只有3-5列,真正的table有20-40列。因此,我需要一个字段一个字段地显式 SELECT 的示例,这不是一个重要的方法。

有没有有效的方法来加入这两个 tables?

Is there a working way to join this two tables?

#standardsQL
SELECT *, NULL AS Col5, NULL AS Col6 FROM table1
UNION ALL
SELECT * FROM table2  

你可以用你的例子来检查这个

#standardsQL
WITH table1 AS (
  SELECT "ID1" AS Col1, "A" AS Col2, "B" AS Col3, "C" AS Col4 
  UNION ALL
  SELECT "ID2", "D", "E", "F"
),
table2 AS (
  SELECT "ID3" Col1, "G" AS Col2, "H" AS Col3, "J" AS Col4, "K" AS Col5, "L" AS Col6 
  UNION ALL
  SELECT "ID4", "M", "N", "O", "P", "Q" 
)
SELECT *, NULL AS Col5, NULL AS Col6 FROM table1
UNION ALL
SELECT * FROM table2

您可以通过 UDF 传递行来处理列名未按位置对齐或表之间的列名数量不同的情况。这是一个例子:

CREATE TEMP FUNCTION CoerceRow(json_row STRING)
RETURNS STRUCT<Col1 STRING, Col2 STRING, Col3 STRING, Col4 STRING, Col5 STRING>
LANGUAGE js AS """
return JSON.parse(json_row);
""";

WITH table1 AS (
  SELECT "A" as Col5, "B" as Col3, "C" AS Col2
  UNION ALL
  SELECT "D" as Col5, "E" as Col3, "F" AS Col2
),

table2 AS (
  SELECT "G" as Col1, "H" as Col2, "J" AS Col3, "K" AS Col4, "L" AS Col5
  UNION ALL
  SELECT "M" as Col1, "N" as Col2, "O" AS Col3, "P" AS Col4, "Q" AS Col5
)
SELECT CoerceRow(json_row).*
FROM (
  SELECT TO_JSON_STRING(t1) AS json_row
  FROM table1 AS t1
  UNION ALL
  SELECT TO_JSON_STRING(t2) AS json_row
  FROM table2 AS t2
);
+------+------+------+------+------+
| Col1 | Col2 | Col3 | Col4 | Col5 |
+------+------+------+------+------+
| NULL | C    | B    | NULL | A    |
| NULL | F    | E    | NULL | D    |
| G    | H    | J    | K    | L    |
| M    | N    | O    | P    | Q    |
+------+------+------+------+------+

请注意,CoerceRow 函数需要在输出中声明您想要的显式行类型。除此之外,合并的表中的列仅按名称匹配。