BigQuery 数据透视表行列

Question

我目前正在 BigQuery 中处理数据，然后导出到 Excel 以进行最终的 Pivot table，并希望能够使用 BigQuery 中的 PIVOT 选项创建相同的数据。

我在大查询中的数据集看起来像

Transaction_Month || ConsumerId || CUST_createdMonth
01/01/2015        || 1          || 01/01/2015
01/01/2015        || 1          || 01/01/2015
01/02/2015        || 1          || 01/01/2015
01/01/2015        || 2          || 01/01/2015
01/02/2015        || 3          || 01/02/2015
01/02/2015        || 4          || 01/02/2015
01/02/2015        || 5          || 01/02/2015
01/03/2015        || 5          || 01/02/2015
01/03/2015        || 6          || 01/03/2015
01/04/2015        || 6          || 01/03/2015
01/06/2015        || 6          || 01/03/2015
01/03/2015        || 7          || 01/03/2015
01/04/2015        || 8          || 01/04/2015
01/05/2015        || 8          || 01/04/2015
01/04/2015        || 9          || 01/04/2015

它本质上是一个附加了客户信息的订单table。

当我将此数据放入 excel 时，我将其添加到数据透视表 table，我将 CUST_createdMonth 添加为行，将 Transaction_Month 添加为列，然后值是 ConsumerID

的不同计数

输出如下

在 BigQuery 中可以进行这种转换吗？

Answer 1

在 BigQuery 中没有执行此操作的好方法，但您可以按照以下想法进行操作

Step 1

运行下面查询

SELECT 'SELECT CUST_createdMonth, ' + 
   GROUP_CONCAT_UNQUOTED(
      'EXACT_COUNT_DISTINCT(IF(Transaction_Month = "' + Transaction_Month + '", ConsumerId, NULL)) as [m_' + REPLACE(Transaction_Month, '/', '_') + ']'
   ) 
   + ' FROM yourTable GROUP BY CUST_createdMonth ORDER BY CUST_createdMonth'
FROM (
  SELECT Transaction_Month 
  FROM yourTable
  GROUP BY Transaction_Month
  ORDER BY Transaction_Month
)

结果 - 您将得到如下字符串（为了便于阅读，格式如下）

SELECT
  CUST_createdMonth,
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/01/2015", ConsumerId, NULL)) AS [m_01_01_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/02/2015", ConsumerId, NULL)) AS [m_01_02_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/03/2015", ConsumerId, NULL)) AS [m_01_03_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/04/2015", ConsumerId, NULL)) AS [m_01_04_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/05/2015", ConsumerId, NULL)) AS [m_01_05_2015],
  EXACT_COUNT_DISTINCT(IF(Transaction_Month = "01/06/2015", ConsumerId, NULL)) AS [m_01_06_2015]
  FROM yourTable 
GROUP BY
  CUST_createdMonth
ORDER BY
  CUST_createdMonth

Step 2

只是运行以上组合查询

结果如下所示

CUST_createdMonth   m_01_01_2015    m_01_02_2015    m_01_03_2015    m_01_04_2015    m_01_05_2015    m_01_06_2015     
01/01/2015          2               1               0               0               0               0    
01/02/2015          0               3               1               0               0               0    
01/03/2015          0               0               2               1               0               1    
01/04/2015          0               0               0               2               1               0

Note

如果您有很多个月的时间来处理太多的手动工作，那么第 1 步会很有帮助。
在这种情况下 - 第 1 步帮助您生成查询

You can see more about pivoting in my other posts.

请注意 – 每个 table 有 10K 列的限制 - 因此您只能使用 10K 个组织。
您还可以看到下面的简化示例（如果上面的例子也是 complex/verbose）：

How to create dummy variable columns for thousands of categories in Google BigQuery?

Answer 2

实际上 Mikhail 还有另一种方法可以将 EAV 类型模式的行转换为列，方法是使用日志记录 tables 并查询最后一个 CREATE TABLE 条目以确定最新的 table 架构。

     CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
              RETURNS ARRAY<STRING> AS ((
                SELECT
                  SPLIT(
                    REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\1,')
                  ,',')
              ));
        WITH valid_schema_columns AS (
          WITH array_output aS (SELECT
            jsonSchemaStringToArray(jsonSchema) AS column_names
          FROM (
            SELECT
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
              , ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
            FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
            WHERE
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
              AND
              protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
          ) AS t
          WHERE
            t.record_count = 1 -- grab the latest entry
          )
          -- this is actually what UNNESTS the array into standard rows
          SELECT
            valid_column_name
          FROM array_output
          LEFT JOIN UNNEST(column_names) AS valid_column_name

        )

BigQuery 数据透视表行列

BigQuery Pivot Data Rows Columns

pivot

pivot-table

google-bigquery