Google ORDER BY 子句中的变量名更改时 Bigquery 不一致
Google Bigquery inconsistent when variable names changes in ORDER BY clause
我的目标是测试一个查询生成的 grp 是否与同一查询的输出相同。但是,当我更改单个变量名称时,会得到不同的结果。
下面我展示了一个 相同查询 的示例,我们知道结果是相同的。但是,如果您 运行 这一组,您会发现一个查询产生的结果与另一个不同。
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test1 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test1
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
更糟...现在,如果您尝试 运行 完全相同的查询,但只需将变量名称 test1 更改为 test3,你会得到完全不同的结果。
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test3 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test3
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
我想不出任何解释都可以满足这两种奇怪的行为,这使我无法验证我的数据。有什么想法吗?
编辑:
我已经按照响应建议的方式更新了 BigQuery SQL,但仍然出现了同样的不一致。
我不明白这个问题。 SQL 一般语法和 BigQuery 特别是都非常清楚:SELECT
中定义的别名不能在 SELECT
中用于其他表达式。如 BigQuery 文档中所述:
Aliases defined in a SELECT
clause can be referenced in the GROUP
BY
, HAVING
, and ORDER BY
clauses of the query, but not by the
FROM
, WHERE
, or OMIT RECORD IF
clauses nor by other expressions
in the same SELECT
clause. [emphasis mine]
因此,只有当 test1
、test2
和 test3
是莎士比亚 table 中的专栏时,您的查询才会有效。没有理由认为这些列会有相似的值,所以我不希望查询 return 相同的结果。
编辑:
如果我们假设文档不正确,那么问题可能与 row_number()
的 order by
条件重复。在 SQL 中排序不是 stable —— 这意味着具有相同排序键值的两行可以在排序期间以任何顺序出现。即使是同一个查询在两次运行中也可能 return 不同的结果。 SQL 排序显然不是 stable,因为 tables 在行之间没有固有的排序(排序仅由列指定)。
因此,所有发生的事情是选择了具有相同排序键值的不同行。我认为这与别名无关。
如何解决这个问题?在排序中添加一个附加排序键,例如 id
,作为最终键。或者使用 rank()
或 dense_rank()
并明确找出如何处理重复项。
我注意到你总是问尖锐的问题,然后你很难接受甚至投票回答。
没关系!我想再试一次所以让我们进入主题:
在同一 SELECT 语句中使用别名似乎未记录且不受支持
SELECT clause 文档中的注释如下:
Each expression can be given an alias by adding a space followed by an
identifier after the expression. The optional AS keyword can be added
between the expression and the alias for improved readability. Aliases
defined in a SELECT clause can be referenced in theGROUP BY, HAVING,
and ORDER BY clauses of the query, but not by the FROM, WHERE, or OMIT
RECORD IF clauses nor by other expressions in the same SELECT clause.
因此,这里有奇怪的行为,没有抛出错误。
因此,您可以自行承担使用它的风险,但最好不要使用它(仍然很高兴收到 Google 团队的消息 – 但由于它不受支持 – 您不能期待太多解释此行为的信息)
同时 - 我建议只遵循支持的内容并将您的查询转换为以下 "stable" 版本。
没有你原来遇到的问题!
(请注意,我已经更改了第一个子查询中的 WHERE 子句——否则它总是 returns 零行——这完全有意义)
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_1] DESC) AS rank1
FROM (
SELECT
word, corpus,
(word_count * word_count * corpus_date) AS [try_any_alias_1],
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM [bigquery-public-data:samples.shakespeare]
)
)
WHERE rank1 <= 3 OR rank2 <= 4 // if rank2 <= 3 as in second subquery - result is always empty as expected
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_2] DESC) AS rank1
FROM
(
SELECT
word, corpus,
(word_count * word_count * corpus_date) AS [try_any_alias_2],
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM [bigquery-public-data:samples.shakespeare]
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
问题是行编号的不确定性。
这个 table 中有很多例子,其中 (word_count * word_count * corpus_date)
对于几个语料库是相同的。因此,当您按 word
分区并按 test2
排序时,用于分配行号的顺序是不确定的。
当您在同一个顶级查询中两次 运行 同一个子查询时,BigQuery 实际上会执行该子查询两次,并且由于不确定性,可能会在两个 运行 之间产生不同的结果。
更改别名可能只会导致您的查询未命中缓存,从而导致一组不同的不确定性选择和结果之间不同的重叠量。
您可以通过将分析函数中的 ORDER BY
子句更改为包含 corpus
来确认这一点。例如,将 ORDER BY test2
更改为 ORDER BY test2, corpus
。然后行编号将是确定的,并且无论您使用什么别名,查询都将 return 零结果。
我的目标是测试一个查询生成的 grp 是否与同一查询的输出相同。但是,当我更改单个变量名称时,会得到不同的结果。
下面我展示了一个 相同查询 的示例,我们知道结果是相同的。但是,如果您 运行 这一组,您会发现一个查询产生的结果与另一个不同。
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test1 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test1
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
更糟...现在,如果您尝试 运行 完全相同的查询,但只需将变量名称 test1 更改为 test3,你会得到完全不同的结果。
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM (
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test3 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test3
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank1, rank2
FROM
(
SELECT
word, corpus,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY test2 DESC) AS rank1,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM
(
SELECT *, (word_count * word_count * corpus_date) AS test2
FROM [bigquery-public-data:samples.shakespeare]
)
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
我想不出任何解释都可以满足这两种奇怪的行为,这使我无法验证我的数据。有什么想法吗?
编辑:
我已经按照响应建议的方式更新了 BigQuery SQL,但仍然出现了同样的不一致。
我不明白这个问题。 SQL 一般语法和 BigQuery 特别是都非常清楚:SELECT
中定义的别名不能在 SELECT
中用于其他表达式。如 BigQuery 文档中所述:
Aliases defined in a
SELECT
clause can be referenced in theGROUP BY
,HAVING
, andORDER BY
clauses of the query, but not by theFROM
,WHERE
, orOMIT RECORD IF
clauses nor by other expressions in the sameSELECT
clause. [emphasis mine]
因此,只有当 test1
、test2
和 test3
是莎士比亚 table 中的专栏时,您的查询才会有效。没有理由认为这些列会有相似的值,所以我不希望查询 return 相同的结果。
编辑:
如果我们假设文档不正确,那么问题可能与 row_number()
的 order by
条件重复。在 SQL 中排序不是 stable —— 这意味着具有相同排序键值的两行可以在排序期间以任何顺序出现。即使是同一个查询在两次运行中也可能 return 不同的结果。 SQL 排序显然不是 stable,因为 tables 在行之间没有固有的排序(排序仅由列指定)。
因此,所有发生的事情是选择了具有相同排序键值的不同行。我认为这与别名无关。
如何解决这个问题?在排序中添加一个附加排序键,例如 id
,作为最终键。或者使用 rank()
或 dense_rank()
并明确找出如何处理重复项。
我注意到你总是问尖锐的问题,然后你很难接受甚至投票回答。 没关系!我想再试一次所以让我们进入主题:
在同一 SELECT 语句中使用别名似乎未记录且不受支持 SELECT clause 文档中的注释如下:
Each expression can be given an alias by adding a space followed by an identifier after the expression. The optional AS keyword can be added between the expression and the alias for improved readability. Aliases defined in a SELECT clause can be referenced in theGROUP BY, HAVING, and ORDER BY clauses of the query, but not by the FROM, WHERE, or OMIT RECORD IF clauses nor by other expressions in the same SELECT clause.
因此,这里有奇怪的行为,没有抛出错误。 因此,您可以自行承担使用它的风险,但最好不要使用它(仍然很高兴收到 Google 团队的消息 – 但由于它不受支持 – 您不能期待太多解释此行为的信息)
同时 - 我建议只遵循支持的内容并将您的查询转换为以下 "stable" 版本。
没有你原来遇到的问题!
(请注意,我已经更改了第一个子查询中的 WHERE 子句——否则它总是 returns 零行——这完全有意义)
SELECT grp
FROM
(
SELECT CONCAT(word, corpus) AS grp, rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_1] DESC) AS rank1
FROM (
SELECT
word, corpus,
(word_count * word_count * corpus_date) AS [try_any_alias_1],
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM [bigquery-public-data:samples.shakespeare]
)
)
WHERE rank1 <= 3 OR rank2 <= 4 // if rank2 <= 3 as in second subquery - result is always empty as expected
HAVING grp NOT IN
(
SELECT grp FROM (
SELECT CONCAT(word, corpus) AS grp, rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY [try_any_alias_2] DESC) AS rank1
FROM
(
SELECT
word, corpus,
(word_count * word_count * corpus_date) AS [try_any_alias_2],
ROW_NUMBER() OVER (PARTITION BY word ORDER BY word_count DESC) AS rank2,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus DESC) AS rank3,
ROW_NUMBER() OVER (PARTITION BY word ORDER BY corpus_date DESC) AS rank4
FROM [bigquery-public-data:samples.shakespeare]
)
)
WHERE rank1 <= 3 OR rank2 <= 3
)
问题是行编号的不确定性。
这个 table 中有很多例子,其中 (word_count * word_count * corpus_date)
对于几个语料库是相同的。因此,当您按 word
分区并按 test2
排序时,用于分配行号的顺序是不确定的。
当您在同一个顶级查询中两次 运行 同一个子查询时,BigQuery 实际上会执行该子查询两次,并且由于不确定性,可能会在两个 运行 之间产生不同的结果。
更改别名可能只会导致您的查询未命中缓存,从而导致一组不同的不确定性选择和结果之间不同的重叠量。
您可以通过将分析函数中的 ORDER BY
子句更改为包含 corpus
来确认这一点。例如,将 ORDER BY test2
更改为 ORDER BY test2, corpus
。然后行编号将是确定的,并且无论您使用什么别名,查询都将 return 零结果。