如何在 Google BigQuery 中转换
How to Pivot in Google BigQuery
假设我向 BQ 发送了以下查询:
SELECT shipmentID, category, quantity
FROM [myDataset.myTable]
进一步,假设查询returns数据如:
shipmentID category quantity
1 shoes 5
1 hats 3
2 shirts 1
2 hats 2
3 toys 3
2 books 1
3 shirts 1
如何从 BQ 中转换结果以产生如下输出:
shipmentID shoes hats shirts toys books
1 5 3 0 0 0
2 0 2 1 0 1
3 0 0 1 3 0
作为一些额外的背景,我实际上有 2000 多个类别需要转换,而且数据量太大,我无法直接通过 Python 中的 Pandas DataFrame 进行转换](使用所有内存,然后慢到爬行)。我尝试使用关系数据库,但 运行 进入列限制,因此我希望能够直接在 BQ 中执行此操作,即使我必须通过 python 构建查询本身。有什么建议吗?
** 编辑 1
我应该提一下,旋转数据本身可以分块完成,因此不是问题。真正的麻烦在于之后尝试进行聚合,因此每个 shipmentID 我只有一行。这就是吃掉所有 RAM 的原因。
** 编辑 2
在尝试了下面接受的答案后,我发现尝试使用它来创建 2k+ 列枢轴 table 会导致 "Resources exceeded" 错误。我的 BQ 团队能够重构查询,将其分解成更小的块并让它通过。查询的基本结构如下:
SELECT
SetA.*,
SetB.*,
SetC.*
FROM (
SELECT
shipmentID,
SUM(IF (category="Rocks", qty, 0)),
SUM(IF (category="Paper", qty, 0)),
SUM(IF (category="Scissors", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetA
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Jello Molds", quantity, 0)),
SUM(IF (category="Torque Wrenches", quantity, 0))
FROM (
SELECT
a.shipmentID shipmentID,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetB
ON
SetA.shipmentid = SetB.shipmentid
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Deep Thoughts", qty, 0)),
SUM(IF (category="Rainbows", qty, 0)),
SUM(IF (category="Ponies", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetC
ON
SetB.shipmentID = SetC.shipmentID
上述模式可以通过一个接一个地添加INNER JOIN EACH
个片段无限期地继续下去。对于我的应用程序,BQ 能够处理每个块大约 500 列。
这是一种方法:
select shipmentID,
sum(IF (category='shoes', quantity, 0)) AS shoes,
sum(IF (category='hats', quantity, 0)) AS hats,
sum(IF (category='shirts', quantity, 0)) AS shirts,
sum(IF (category='toys', quantity, 0)) AS toys,
sum(IF (category='books', quantity, 0)) AS books,
from
(select 1 as shipmentID, 'shoes' as category, 5 as quantity),
(select 1 as shipmentID, 'hats' as category, 3 as quantity),
(select 2 as shipmentID, 'shirts' as category, 1 as quantity),
(select 2 as shipmentID, 'hats' as category, 2 as quantity),
(select 3 as shipmentID, 'toys' as category, 3 as quantity),
(select 2 as shipmentID, 'books' as category, 1 as quantity),
(select 3 as shipmentID, 'shirts' as category, 1 as quantity),
group by shipmentID
这个returns:
+-----+------------+-------+------+--------+------+-------+---+
| Row | shipmentID | shoes | hats | shirts | toys | books | |
+-----+------------+-------+------+--------+------+-------+---+
| 1 | 1 | 5 | 3 | 0 | 0 | 0 | |
| 2 | 2 | 0 | 2 | 1 | 0 | 1 | |
| 3 | 3 | 0 | 0 | 1 | 3 | 0 | |
+-----+------------+-------+------+--------+------+-------+---+
其他 pivot table example 见手册。
假设我向 BQ 发送了以下查询:
SELECT shipmentID, category, quantity
FROM [myDataset.myTable]
进一步,假设查询returns数据如:
shipmentID category quantity
1 shoes 5
1 hats 3
2 shirts 1
2 hats 2
3 toys 3
2 books 1
3 shirts 1
如何从 BQ 中转换结果以产生如下输出:
shipmentID shoes hats shirts toys books
1 5 3 0 0 0
2 0 2 1 0 1
3 0 0 1 3 0
作为一些额外的背景,我实际上有 2000 多个类别需要转换,而且数据量太大,我无法直接通过 Python 中的 Pandas DataFrame 进行转换](使用所有内存,然后慢到爬行)。我尝试使用关系数据库,但 运行 进入列限制,因此我希望能够直接在 BQ 中执行此操作,即使我必须通过 python 构建查询本身。有什么建议吗?
** 编辑 1 我应该提一下,旋转数据本身可以分块完成,因此不是问题。真正的麻烦在于之后尝试进行聚合,因此每个 shipmentID 我只有一行。这就是吃掉所有 RAM 的原因。
** 编辑 2 在尝试了下面接受的答案后,我发现尝试使用它来创建 2k+ 列枢轴 table 会导致 "Resources exceeded" 错误。我的 BQ 团队能够重构查询,将其分解成更小的块并让它通过。查询的基本结构如下:
SELECT
SetA.*,
SetB.*,
SetC.*
FROM (
SELECT
shipmentID,
SUM(IF (category="Rocks", qty, 0)),
SUM(IF (category="Paper", qty, 0)),
SUM(IF (category="Scissors", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetA
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Jello Molds", quantity, 0)),
SUM(IF (category="Torque Wrenches", quantity, 0))
FROM (
SELECT
a.shipmentID shipmentID,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetB
ON
SetA.shipmentid = SetB.shipmentid
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Deep Thoughts", qty, 0)),
SUM(IF (category="Rainbows", qty, 0)),
SUM(IF (category="Ponies", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetC
ON
SetB.shipmentID = SetC.shipmentID
上述模式可以通过一个接一个地添加INNER JOIN EACH
个片段无限期地继续下去。对于我的应用程序,BQ 能够处理每个块大约 500 列。
这是一种方法:
select shipmentID,
sum(IF (category='shoes', quantity, 0)) AS shoes,
sum(IF (category='hats', quantity, 0)) AS hats,
sum(IF (category='shirts', quantity, 0)) AS shirts,
sum(IF (category='toys', quantity, 0)) AS toys,
sum(IF (category='books', quantity, 0)) AS books,
from
(select 1 as shipmentID, 'shoes' as category, 5 as quantity),
(select 1 as shipmentID, 'hats' as category, 3 as quantity),
(select 2 as shipmentID, 'shirts' as category, 1 as quantity),
(select 2 as shipmentID, 'hats' as category, 2 as quantity),
(select 3 as shipmentID, 'toys' as category, 3 as quantity),
(select 2 as shipmentID, 'books' as category, 1 as quantity),
(select 3 as shipmentID, 'shirts' as category, 1 as quantity),
group by shipmentID
这个returns:
+-----+------------+-------+------+--------+------+-------+---+
| Row | shipmentID | shoes | hats | shirts | toys | books | |
+-----+------------+-------+------+--------+------+-------+---+
| 1 | 1 | 5 | 3 | 0 | 0 | 0 | |
| 2 | 2 | 0 | 2 | 1 | 0 | 1 | |
| 3 | 3 | 0 | 0 | 1 | 3 | 0 | |
+-----+------------+-------+------+--------+------+-------+---+
其他 pivot table example 见手册。