BigQuery 数据透视表行列

BigQuery Pivot Data Rows Columns

我目前正在 BigQuery 中处理数据,然后导出到 Excel 以进行最终的 Pivot table,并希望能够使用 BigQuery 中的 PIVOT 选项创建相同的数据。

我在大查询中的数据集看起来像

Transaction_Month || ConsumerId || CUST_createdMonth
01/01/2015        || 1          || 01/01/2015
01/01/2015        || 1          || 01/01/2015
01/02/2015        || 1          || 01/01/2015
01/01/2015        || 2          || 01/01/2015
01/02/2015        || 3          || 01/02/2015
01/02/2015        || 4          || 01/02/2015
01/02/2015        || 5          || 01/02/2015
01/03/2015        || 5          || 01/02/2015
01/03/2015        || 6          || 01/03/2015
01/04/2015        || 6          || 01/03/2015
01/06/2015        || 6          || 01/03/2015
01/03/2015        || 7          || 01/03/2015
01/04/2015        || 8          || 01/04/2015
01/05/2015        || 8          || 01/04/2015
01/04/2015        || 9          || 01/04/2015

它本质上是一个附加了客户信息的订单table。

当我将此数据放入 excel 时,我将其添加到数据透视表 table,我将 CUST_createdMonth 添加为行,将 Transaction_Month 添加为列,然后值是 ConsumerID

的不同计数

输出如下

在 BigQuery 中可以进行这种转换吗?

在 BigQuery 中没有执行此操作的好方法,但您可以按照以下想法进行操作

Step 1

运行 下面查询

SELECT 'SELECT CUST_createdMonth, ' + 
   GROUP_CONCAT_UNQUOTED(
      'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
   ) 
   + ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
  SELECT Transaction_Month 
  FROM yourTable
  GROUP BY Transaction_Month
  ORDER BY Transaction_Month
) 

结果 - 您将得到如下字符串(为了便于阅读,格式如下)

SELECT
  CUST_createdMonth,
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
  FROM yourTable 
GROUP BY
  CUST_createdMonth
ORDER BY
  CUST_createdMonth

Step 2

只是 运行 以上组合查询

结果如下所示

CUST_createdMonth   m_01_01_2015    m_01_02_2015    m_01_03_2015    m_01_04_2015    m_01_05_2015    m_01_06_2015     
01/01/2015          2               1               0               0               0               0    
01/02/2015          0               3               1               0               0               0    
01/03/2015          0               0               2               1               0               1    
01/04/2015          0               0               0               2               1               0   

Note

如果您有很多个月的时间来处理太多的手动工作,那么第 1 步会很有帮助。
在这种情况下 - 第 1 步帮助您生成查询

You can see more about pivoting in my other posts.


请注意 – 每个 table 有 10K 列的限制 - 因此您只能使用 10K 个组织。
您还可以看到下面的简化示例(如果上面的例子也是 complex/verbose):

How to create dummy variable columns for thousands of categories in Google BigQuery?

实际上 Mikhail 还有另一种方法可以将 EAV 类型模式的行转换为列,方法是使用日志记录 tables 并查询最后一个 CREATE TABLE 条目以确定最新的 table 架构。

     CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
              RETURNS ARRAY<STRING> AS ((
                SELECT
                  SPLIT(
                    REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\1,')
                  ,',')
              ));
        WITH valid_schema_columns AS (
          WITH array_output aS (SELECT
            jsonSchemaStringToArray(jsonSchema) AS column_names
          FROM (
            SELECT
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
              , ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
            FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
            WHERE
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
          ) AS t
          WHERE
            t.record_count = 1 -- grab the latest entry
          )
          -- this is actually what UNNESTS the array into standard rows
          SELECT
            valid_column_name
          FROM array_output
          LEFT JOIN UNNEST(column_names) AS valid_column_name

        )