在 json 文档中没有指定键的情况下，从 bigquery 中的 json 字符串中提取键和值

Question

我在 bigquery 中有一个 table，我有一个对象，对于每个对象，我都有一些字符串化的 json。在 json 中，示例行如下所示：

{
    "ObjectID": "1984931229",
    "indexed_abstract": "{\"IndexLength\":123,\"InvertedIndex\":{\"Twenty-seven\":[0],\"metastatic\":[1,45],\"breast\":[2],\"adenocarcinoma\":[3],\"patients,\":[4]}}" 
}

在 indexed_abstract 中我们有一个 InvertedIndex，其中包含一些关键字以及这些关键字在 ObjectID 中出现的次数。

现在我想通过使用 bigquery 解析 json 来访问字符串化的 json 并且对于每个 ObjectID 我想创建一个嵌套字段，其中我有关键字，相应的数组和对应数组的长度。

例如，在这种情况下，输出如下所示：

+------------+----------------+---------------+-------------------+
|  ObjectID  |  keyword.key   | keyword.count | keyword.positions |
+------------+----------------+---------------+-------------------+
| 1984931229 | Twenty-seven   |             1 | [0]               |
|            | metastatic     |             2 | [1,45]            |
|            | breast         |             1 | [2]               |
|            | adenocarcinoma |             1 | [3]               |
|            | patients       |             1 | [4]               |
+------------+----------------+---------------+-------------------+

我知道我可以使用 JSON_EXTRACT 函数，但我不确定倒排索引中的键是什么来访问关键字和对应的数组。

Answer 1

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT ObjectID, 
  ARRAY(
    SELECT AS STRUCT 
      key, 
      ARRAY_LENGTH(SPLIT(value)) `count`, 
      value positions 
    FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(indexed_abstract, '$.InvertedIndex'), r'"[^"]+":\[[\d,]*?]')) pair,
    UNNEST([STRUCT(REPLACE(SPLIT(pair, ':')[OFFSET(0)], '"', '') AS key, SPLIT(pair, ':')[OFFSET(1)] AS value)])
  ) keyword
FROM `project.dataset.table`

如果应用于您问题中的样本数据 - 结果是

Row ObjectID    keyword.key     keyword.count   keyword.positions    
1   1984931229  Twenty-seven    1               [0]  
                metastatic      2               [1,45]   
                breast          1               [2]  
                adenocarcinoma  1               [3]  
                patients        1               [4]

Update on Op's comment - I was wondering if I wanted to make the positions an array (a repeated field), how would I do that?

只需一行即可完成更改

  SPLIT(REGEXP_REPLACE(value, r'\[|]', '')) positions

在 json 文档中没有指定键的情况下，从 bigquery 中的 json 字符串中提取键和值

Extract keys and values from json string in bigquery where there is no specified key in the json document

json

google-bigquery

json-extract