在 BigQuery 中展平 json 字符串

Flatten json string in BigQuery

我有一个无法规范化数据的自定义 Airbyte 作业,因此我需要手动进行。以下数据来自我们的人力资源系统:


{
  "title": "My Report", 
  "fields": [{
      "id": "employeeNumber", 
      "name": "Employee #"
    }, 
    {
      "id": "firstName" 
      "name": "First Name"
    }, 
    { 
      "id": "lastName"
      "name": "Last Name"
    }], 
    "employees": [{ 
      "employeeNumber": "1234", 
      "firstName": "Ann", 
      "lastName": "Perkins" 
    }, 
    { 
      "employeeNumber": "5678", 
      "firstName": "Bob", 
      "lastName": "Builder" 
    }]
}

我当前的 bigquery table 看起来像这样(json 存储为字符串):

_airbyte_ab_id _airbyte_emitted_at _airbyte_data
123abc 2022-01-30 19:41:59 UTC {"title": "My Datawareouse", "fields": [ {"id": "employeeNumber", "name": "Employee_Number"}, {"id": "firstName", "name": "First_Name" }, { "id": "lastName", "name": "Last_Name"} ], "employees": [ { "employeeNumber": "1234", "firstName": "Ann", "lastName": "Perkins" }, { "employeeNumber": "5678", "firstName": "Bob", "lastName": "Builder" } ] }

我正在尝试将 table 规范化为如下所示:

_airbyte_ab_id _airbyte_emitted_at Employee_Number First_Name Last_Name
123abc 2022-01-30 19:41:59 UTC 1234 Ann Perkins
123abc 2022-01-30 19:41:59 UTC 5678 Bob Builder

如何在bigquery 中使用SQL 将json 展平为列,如上例? (该脚本将从 dbt 运行ning,但现在,我只是试图获得对 运行 的有效查询)

我应该补充一点,实际的 json 有更多的字段,它们可能会发生变化,我希望“中间名”等内容为空值。因此,在一个完美的世界中,我不必定义每个列名,而是通过读取“Fields”数组动态地运行。

How to flatten the json into columns as the example above, using SQL in bigquery?

考虑以下方法

select _airbyte_ab_id, _airbyte_emitted_at, 
  json_value(employee, '$.employeeNumber') employeeNumber,
  json_value(employee, '$.firstName') firstName,
  json_value(employee, '$.lastName') lastName
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee         

如果应用于您问题中的示例数据 - 输出为

... in a perfect world, I would not have to define each column name, but have it run dynamically by reading the "Fields" array

如果您的字段是动态定义的,并且行与行之间甚至可能不同 - 我建议考虑以下扁平化方法

select _airbyte_ab_id, _airbyte_emitted_at, 
  md5(employee) employee_hash,
  json_value(field, "$.id") key,
  regexp_extract(employee, r'"' || json_value(field, "$.id") || '":"(.*?)"') value
from your_table,
unnest(json_extract_array(_airbyte_data, '$.employees')) employee,
unnest(json_extract_array(_airbyte_data, '$.fields')) field       

如果应用于您问题中的示例数据 - 输出为