BigQuery 数据透视表行列
BigQuery Pivot Data Rows Columns
我目前正在 BigQuery 中处理数据,然后导出到 Excel 以进行最终的 Pivot table,并希望能够使用 BigQuery 中的 PIVOT 选项创建相同的数据。
我在大查询中的数据集看起来像
Transaction_Month || ConsumerId || CUST_createdMonth
01/01/2015 || 1 || 01/01/2015
01/01/2015 || 1 || 01/01/2015
01/02/2015 || 1 || 01/01/2015
01/01/2015 || 2 || 01/01/2015
01/02/2015 || 3 || 01/02/2015
01/02/2015 || 4 || 01/02/2015
01/02/2015 || 5 || 01/02/2015
01/03/2015 || 5 || 01/02/2015
01/03/2015 || 6 || 01/03/2015
01/04/2015 || 6 || 01/03/2015
01/06/2015 || 6 || 01/03/2015
01/03/2015 || 7 || 01/03/2015
01/04/2015 || 8 || 01/04/2015
01/05/2015 || 8 || 01/04/2015
01/04/2015 || 9 || 01/04/2015
它本质上是一个附加了客户信息的订单table。
当我将此数据放入 excel 时,我将其添加到数据透视表 table,我将 CUST_createdMonth 添加为行,将 Transaction_Month 添加为列,然后值是 ConsumerID
的不同计数
输出如下
在 BigQuery 中可以进行这种转换吗?
在 BigQuery 中没有执行此操作的好方法,但您可以按照以下想法进行操作
Step 1
运行 下面查询
SELECT 'SELECT CUST_createdMonth, ' +
GROUP_CONCAT_UNQUOTED(
'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
)
+ ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
SELECT Transaction_Month
FROM yourTable
GROUP BY Transaction_Month
ORDER BY Transaction_Month
)
结果 - 您将得到如下字符串(为了便于阅读,格式如下)
SELECT
CUST_createdMonth,
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
FROM yourTable
GROUP BY
CUST_createdMonth
ORDER BY
CUST_createdMonth
Step 2
只是 运行 以上组合查询
结果如下所示
CUST_createdMonth m_01_01_2015 m_01_02_2015 m_01_03_2015 m_01_04_2015 m_01_05_2015 m_01_06_2015
01/01/2015 2 1 0 0 0 0
01/02/2015 0 3 1 0 0 0
01/03/2015 0 0 2 1 0 1
01/04/2015 0 0 0 2 1 0
Note
如果您有很多个月的时间来处理太多的手动工作,那么第 1 步会很有帮助。
在这种情况下 - 第 1 步帮助您生成查询
You can see more about pivoting in my other posts.
请注意 – 每个 table 有 10K 列的限制 - 因此您只能使用 10K 个组织。
您还可以看到下面的简化示例(如果上面的例子也是 complex/verbose):
How to create dummy variable columns for thousands of categories in Google BigQuery?
实际上 Mikhail 还有另一种方法可以将 EAV 类型模式的行转换为列,方法是使用日志记录 tables 并查询最后一个 CREATE TABLE 条目以确定最新的 table 架构。
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)
我目前正在 BigQuery 中处理数据,然后导出到 Excel 以进行最终的 Pivot table,并希望能够使用 BigQuery 中的 PIVOT 选项创建相同的数据。
我在大查询中的数据集看起来像
Transaction_Month || ConsumerId || CUST_createdMonth
01/01/2015 || 1 || 01/01/2015
01/01/2015 || 1 || 01/01/2015
01/02/2015 || 1 || 01/01/2015
01/01/2015 || 2 || 01/01/2015
01/02/2015 || 3 || 01/02/2015
01/02/2015 || 4 || 01/02/2015
01/02/2015 || 5 || 01/02/2015
01/03/2015 || 5 || 01/02/2015
01/03/2015 || 6 || 01/03/2015
01/04/2015 || 6 || 01/03/2015
01/06/2015 || 6 || 01/03/2015
01/03/2015 || 7 || 01/03/2015
01/04/2015 || 8 || 01/04/2015
01/05/2015 || 8 || 01/04/2015
01/04/2015 || 9 || 01/04/2015
它本质上是一个附加了客户信息的订单table。
当我将此数据放入 excel 时,我将其添加到数据透视表 table,我将 CUST_createdMonth 添加为行,将 Transaction_Month 添加为列,然后值是 ConsumerID
的不同计数输出如下
在 BigQuery 中可以进行这种转换吗?
在 BigQuery 中没有执行此操作的好方法,但您可以按照以下想法进行操作
Step 1
运行 下面查询
SELECT 'SELECT CUST_createdMonth, ' +
GROUP_CONCAT_UNQUOTED(
'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
)
+ ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
SELECT Transaction_Month
FROM yourTable
GROUP BY Transaction_Month
ORDER BY Transaction_Month
)
结果 - 您将得到如下字符串(为了便于阅读,格式如下)
SELECT
CUST_createdMonth,
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
FROM yourTable
GROUP BY
CUST_createdMonth
ORDER BY
CUST_createdMonth
Step 2
只是 运行 以上组合查询
结果如下所示
CUST_createdMonth m_01_01_2015 m_01_02_2015 m_01_03_2015 m_01_04_2015 m_01_05_2015 m_01_06_2015
01/01/2015 2 1 0 0 0 0
01/02/2015 0 3 1 0 0 0
01/03/2015 0 0 2 1 0 1
01/04/2015 0 0 0 2 1 0
Note
如果您有很多个月的时间来处理太多的手动工作,那么第 1 步会很有帮助。
在这种情况下 - 第 1 步帮助您生成查询
You can see more about pivoting in my other posts.
请注意 – 每个 table 有 10K 列的限制 - 因此您只能使用 10K 个组织。
您还可以看到下面的简化示例(如果上面的例子也是 complex/verbose):
How to create dummy variable columns for thousands of categories in Google BigQuery?
实际上 Mikhail 还有另一种方法可以将 EAV 类型模式的行转换为列,方法是使用日志记录 tables 并查询最后一个 CREATE TABLE 条目以确定最新的 table 架构。
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)