使用数据集从 Hive 中的字符串中提取 json 字段

Extracting json field from string in Hive using dataset

我正在尝试一个非常基本的配置单元查询。我正在尝试从数据集中提取 json 字段,但我总是得到

\N

对于 json 字段,但是 some_string 没问题

这是我的查询:

WITH dataset AS (
SELECT
CAST(
   '{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }' AS STRING) AS some_string
)
SELECT some_string, get_json_object(dataset.some_string, '$.traceId') FROM dataset

问题:如何在此处获取 json 字段?

问题出在反斜杠上。单反斜杠被视为 " 的转义字符并被 Hive 删除:

hive> select '\"';
OK
"
Time taken: 0.069 seconds, Fetched: 1 row(s)

当您有两个反斜杠时,Hive 会删除一个:

hive> select '\"';
OK
\"
Time taken: 0.061 seconds, Fetched: 1 row(s)

使用两个反斜杠可以正常工作:

WITH dataset AS (
  SELECT
  CAST(
     '{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
       }' AS STRING) AS some_string
   )
   SELECT some_string,  get_json_object(dataset.some_string, '$.traceId') FROM dataset;
OK
{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }   abc
Time taken: 0.788 seconds, Fetched: 1 row(s)

您还可以轻松删除 additionalData:

中 { 之前和 } 之后的 double-quotes
WITH dataset AS (
SELECT
regexp_replace(regexp_replace(
   '{ "traceId": "abc", "additionalData": "{\"Star Rating\":\"3\"}",  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }' ,'\"\{','\{') ,'\}\"','\}' )AS some_string
)
SELECT some_string,  get_json_object(dataset.some_string, '$.traceId') FROM dataset;

Returns:

OK
{ "traceId": "abc", "additionalData": {"Star Rating":"3"},  "locale": "en_US", "content": { "contentType": "PB", "content": "T S", "bP": { "mD": { "S R": "3" }, "cType": "T_S", "sType": "unknown-s", "bTimestamp": 0, "title": "T S" } }
    }   abc
Time taken: 7.035 seconds, Fetched: 1 row(s)