如何从 json 字符串中提取重复的嵌套字段并与 bigquery 中现有的重复嵌套字段连接

How to extract a repeated nested field from json string and join with existing repeated nested field in bigquery

我有一个 table,其中有一个嵌套的重复字段 article_id 和一个包含 json 字符串的字符串字段。

这是我的 table 的架构:

这是 table 的示例行:

[
  {
"article_id": "2732930586",
"author_names": [
  {
    "AuN": "h kanahashi",
    "AuId": "2591665239",
    "AfN": null,
    "AfId": null,
    "S": "1"
  },
  {
    "AuN": "t mukai",
    "AuId": "2607493793",
    "AfN": null,
    "AfId": null,
    "S": "2"
  },
  {
    "AuN": "y yamada",
    "AuId": "2606624579",
    "AfN": null,
    "AfId": null,
    "S": "3"
  },
  {
    "AuN": "k shimojima",
    "AuId": "2606600298",
    "AfN": null,
    "AfId": null,
    "S": "4"
  },
  {
    "AuN": "m mabuchi",
    "AuId": "2606138976",
    "AfN": null,
    "AfId": null,
    "S": "5"
  },
  {
    "AuN": "t aizawa",
    "AuId": "2723380540",
    "AfN": null,
    "AfId": null,
    "S": "6"
  },
  {
    "AuN": "k higashi",
    "AuId": "2725066679",
    "AfN": null,
    "AfId": null,
    "S": "7"
  }
],
"extra_informations": "{
\"DN\": \"Experimental study for improvement of crashworthiness in AZ91 magnesium foam controlling its microstructure.\",
\"S\":[{\"Ty\":1,\"U\":\"https://shibaura.pure.elsevier.com/en/publications/experimental-study-for-improvement-of-crashworthiness-in-az91-mag\"}],
 \"VFN\":\"Materials Science and Engineering\",
 \"FP\":283,
 \"LP\":287,
 \"RP\":[{\"Id\":2024275625,\"CoC\":5},{\"Id\":2035451257,\"CoC\":5},     {\"Id\":2141952446,\"CoC\":5},{\"Id\":2126566553,\"CoC\":6},  {\"Id\":2089573897,\"CoC\":5},{\"Id\":2069241702,\"CoC\":7},  {\"Id\":2000323790,\"CoC\":6},{\"Id\":1988924750,\"CoC\":16}],
\"ANF\":[
{\"FN\":\"H.\",\"LN\":\"Kanahashi\",\"S\":1},
{\"FN\":\"T.\",\"LN\":\"Mukai\",\"S\":2},    
{\"FN\":\"Y.\",\"LN\":\"Yamada\",\"S\":3},    
{\"FN\":\"K.\",\"LN\":\"Shimojima\",\"S\":4},    
{\"FN\":\"M.\",\"LN\":\"Mabuchi\",\"S\":5},    
{\"FN\":\"T.\",\"LN\":\"Aizawa\",\"S\":6},    
{\"FN\":\"K.\",\"LN\":\"Higashi\",\"S\":7}
],
\"BV\":\"Materials Science and Engineering\",\"BT\":\"a\"}"
  }
]

extra_information.ANF 我有一个包含更多作者姓名信息的嵌套数组。

嵌套的重复 author_name 字段有一个子字段 author_name.S,可以映射到 extra_informations.ANF.S 以进行连接。使用此映射,我试图实现以下 table:

| article_id | author_names.AuN | S | extra_information.ANF.FN | extra_information.ANF.LN|
| 2732930586 |  h kanahashi     | 1 | H.                       | Kanahashi               | 
| 2732930586 |  t mukai         | 2 | T.                       | Mukai                   | 
| 2732930586 |  y yamada        | 3 | Y.                       | Yamada.                 |
| 2732930586 |  k shimojima     | 4 | K.                       | Shimojima               |
| 2732930586 |  m mabuchi       | 5 | M.                       | Mabuchi                 |
| 2732930586 |  t aizawa        | 6 | T.                       | Aizawa                  |
| 2732930586 |  k higashi       | 7 | K.                       | Higashi                 |

我面临的主要问题是,当我使用 JSON_EXTRACT(extra_information,"$.ANF") 转换 json_string 时,它没有给我一个数组,而是给我嵌套重复数组的字符串格式,它我无法转换成数组。

是否可以在 bigquery 中使用标准 sql 生成这样的 table?

Option 1

这是基于 REGEXP_REPLACE 函数和一些其他函数(REPLACE、SPLIT 等)来操作结果。注意 - 我们需要额外的操作,因为 BigQuery 中的 JsonPath 表达式不支持通配符和过滤器?

#standard SQL
SELECT 
  article_id, author.AuN, author.S, 
  REPLACE(SPLIT(extra, '","')[OFFSET(0)], '"FN":"', '') FirstName,
  REPLACE(SPLIT(extra, '","')[OFFSET(1)], 'LN":"', '') LastName
FROM `table` , UNNEST(author_names) author
LEFT JOIN UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(extra_informations, '$.ANF'), r'\[{|}\]', ''), '},{')) extra
ON author.S = CAST(REPLACE(SPLIT(extra, '","')[OFFSET(2)], 'S":', '') AS INT64) 

Option 2

要克服 JsonPath 的 BigQuery "limitation",您可以使用 custom function,如下例所示:
注意:它使用 jsonpath-0.8.0.js 可以从 https://code.google.com/archive/p/jsonpath/downloads 下载并假设上传到 Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js

#standard SQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
    try { var parsed = JSON.parse(json);
        return jsonPath(parsed, json_path);
    } catch (e) { return null }
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
SELECT 
  article_id, author.AuN, author.S,
  CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].FN')) FirstName,
  CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].LN')) LastName
FROM `table`, UNNEST(author_names) author 

如您所见 - 现在您可以在一个简单的 JsonPath 中完成所有魔术