如何在 Google BigQuery 中旋转数据集？

Question

我有一个具有此架构的海量数据集：

Customer    INTEGER
CategoryID  INTEGER
CategoryName    STRING
ProjectStage    INTEGER
NextStepID  INTEGER
NextStepName    STRING
NextStepIsAnchor    BOOLEAN

我注意得到结果集，其中每个客户都只在一行中，his/her接下来的步骤将在这样的列中：

我尝试使用 BigQuery 的 NTH 函数，但它仅适用于第一次出现的 NextStepID：

SELECT 
customer, 
nth(1, NextStepID)
FROM [2015_05.customers_wunique_nextsteps] 
group by customer

但是当我尝试添加更多列时：

SELECT 
customer, 
nth(1, NextStepID),
nth(2, NextStepID)
FROM [2015_05.customers_wunique_nextsteps] 
group by customer

我收到此错误：

Error: Function 'NTH(2, [NextStepID])' cannot be used in a distributed query, this function can only be correctly computed for queries that run on a single node.

有什么想法吗？现在我 "pivot" 使用 Excel 和小 VBA 脚本的结果，但是当数据集变大时计算时间超过所有限制...

提前致谢！ :)

Answer 1

函数NTH适用于REPEATED字段，选择第n个重复元素（错误信息可以改进）。因此，第一步是从 NextStepID 构建 REPEATED 字段，这可以通过 NEST 聚合函数来完成。然后你可以使用 NTH 作为作用域聚合函数：

SELECT
  Customer,
  NTH(1, NextStepID) WITHIN RECORD AS NextStepID1,
  NTH(2, NextStepID) WITHIN RECORD AS NextStepID2,
  NTH(3, NextStepID) WITHIN RECORD AS NextStepID3
FROM (
SELECT Customer, NEST(NextStepID) AS NextStepID
FROM [2015_05.customers_wunique_nextsteps] GROUP BY Customer)

如何在 Google BigQuery 中旋转数据集？

How can I pivot dataset in Google BigQuery?

sql

pivot

google-bigquery