两列上的 BigQuery 重复数据删除作为唯一键
BigQuery DeDuplication on two columns as unique key
我们虔诚地使用 BigQuery 并且有两个 tables,它们基本上是由不同进程并行更新的。我遇到的问题是 tables 没有唯一标识符,目标是尽可能将两个 tables 合并为零重复。唯一标识符是两列的组合。
我尝试了各种基于 MySQL 的查询,但 none 似乎在 BigQuery 中有效。所以我在这里发帖寻求帮助。 :)
第 1 步。将 "clean" table 复制到新合并的 table。
第 2 步。查询 "dirty"(旧)table 并插入任何缺失的条目。
查询尝试 1:
SELECT
COUNT(c.*)
FROM
[flash-student-96619:device_data.device_datav3_20160530] AS old
WHERE NOT EXISTS (
SELECT
1
FROM
[flash-student-96619:device_data_v7_merged.20160530] AS new
WHERE
new.dsn = old.dsn
AND new.timestamp = old.timestamp
)
错误:错误发生在:6.1 - 10.65。一次只能执行一个查询。
查询尝试 2:
SELECT
*
FROM
[flash-student-96619:device_data.device_datav3_20160530]
WHERE
(dsn, timestamp) NOT IN (
SELECT
dsn,
timestamp
FROM
[flash-student-96619:device_data_v7_merged.20160530]
)
错误:在第 6 行第 7 列遇到“”、“”、“”。预期为:“)”...
老实说,如果我能在一个查询中做到这一点,我会很高兴。我需要从两个 table 中获取数据,并制作一个具有唯一数据的新数据。
有什么帮助吗?
像下面这样的东西应该可以工作
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY dsn, timestamp) AS dup
FROM
[flash-student-96619:device_data.device_datav3_20160530],
[flash-student-96619:device_data_v7_merged.20160530]
)
WHERE dup = 1
我建议在外部 SELECT 中使用明确的字段列表而不是 *,这样您就可以从实际输出中省略 dup
有点晚了,但我想指出,您的原始查询使用 standard SQL 进行了微小的修改(取消选中 "Show Options" 下的 "Use Legacy SQL" 框)。我只需要将 new
更改为其他内容,因为这是一个保留关键字。例如,此查询有效:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
COUNT(*)
FROM OldData oldData
WHERE NOT EXISTS (
SELECT 1
FROM NewData newData
WHERE
newData.dsn = oldData.dsn
AND newData.timestamp = oldData.timestamp
);
+-----+
| f0_ |
+-----+
| 2 |
+-----+
关于你的第二次尝试,你可以这样做:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
*
FROM OldData
WHERE
STRUCT(dsn, timestamp) NOT IN (
SELECT AS STRUCT
dsn,
timestamp
FROM NewData);
+-----+---------------------+
| dsn | timestamp |
+-----+---------------------+
| 3 | 2016-07-21 11:54:08 |
| 4 | 2016-07-21 10:54:08 |
+-----+---------------------+
我们虔诚地使用 BigQuery 并且有两个 tables,它们基本上是由不同进程并行更新的。我遇到的问题是 tables 没有唯一标识符,目标是尽可能将两个 tables 合并为零重复。唯一标识符是两列的组合。
我尝试了各种基于 MySQL 的查询,但 none 似乎在 BigQuery 中有效。所以我在这里发帖寻求帮助。 :)
第 1 步。将 "clean" table 复制到新合并的 table。
第 2 步。查询 "dirty"(旧)table 并插入任何缺失的条目。
查询尝试 1:
SELECT
COUNT(c.*)
FROM
[flash-student-96619:device_data.device_datav3_20160530] AS old
WHERE NOT EXISTS (
SELECT
1
FROM
[flash-student-96619:device_data_v7_merged.20160530] AS new
WHERE
new.dsn = old.dsn
AND new.timestamp = old.timestamp
)
错误:错误发生在:6.1 - 10.65。一次只能执行一个查询。
查询尝试 2:
SELECT
*
FROM
[flash-student-96619:device_data.device_datav3_20160530]
WHERE
(dsn, timestamp) NOT IN (
SELECT
dsn,
timestamp
FROM
[flash-student-96619:device_data_v7_merged.20160530]
)
错误:在第 6 行第 7 列遇到“”、“”、“”。预期为:“)”...
老实说,如果我能在一个查询中做到这一点,我会很高兴。我需要从两个 table 中获取数据,并制作一个具有唯一数据的新数据。
有什么帮助吗?
像下面这样的东西应该可以工作
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY dsn, timestamp) AS dup
FROM
[flash-student-96619:device_data.device_datav3_20160530],
[flash-student-96619:device_data_v7_merged.20160530]
)
WHERE dup = 1
我建议在外部 SELECT 中使用明确的字段列表而不是 *,这样您就可以从实际输出中省略 dup
有点晚了,但我想指出,您的原始查询使用 standard SQL 进行了微小的修改(取消选中 "Show Options" 下的 "Use Legacy SQL" 框)。我只需要将 new
更改为其他内容,因为这是一个保留关键字。例如,此查询有效:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
COUNT(*)
FROM OldData oldData
WHERE NOT EXISTS (
SELECT 1
FROM NewData newData
WHERE
newData.dsn = oldData.dsn
AND newData.timestamp = oldData.timestamp
);
+-----+
| f0_ |
+-----+
| 2 |
+-----+
关于你的第二次尝试,你可以这样做:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
*
FROM OldData
WHERE
STRUCT(dsn, timestamp) NOT IN (
SELECT AS STRUCT
dsn,
timestamp
FROM NewData);
+-----+---------------------+
| dsn | timestamp |
+-----+---------------------+
| 3 | 2016-07-21 11:54:08 |
| 4 | 2016-07-21 10:54:08 |
+-----+---------------------+