Hive - 重新格式化数据结构
Hive - Reformat data structure
所以我有一个 Hive 数据样本:
Customer
xx_var
yy_var
branchflow
{"customer_no":"239230293892839892","acct":["2324325","23425345"]}
23
3
[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]
我想把它改造成这样的:
Customer_no
acct
xx_var
yy_var
branchflow
239230293892839892
2324325
23
3
[1,2,3,4,5,6,6,6,4]
239230293892839892
23425345
23
3
[1,2,3,4,5,6,6,6,99,4]
我试过使用此查询,但输出格式错误。
SELECT
customer.customer_no,
acct,
xx_var,
yy_var,
bi_acctno,
values_bi
FROM
struct_test
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno
有谁知道如何解决这个问题?
使用json_tuple提取JSON个元素。在数组的情况下,它 returns 它也作为字符串:删除方括号,拆分和分解。请参阅演示代码中的注释。
演示:
with mytable as (--demo data, use your table instead of this CTE
select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,
23 xx_var, 3 yy_var,
'[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
)
select c.customer_no,
a.acct,
t.xx_var, t.yy_var,
get_json_object(b.acct_branchflow,'$.value') value
from mytable t
--extract customer_no and acct array
lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
--remove [] and " and explode array of acct
lateral view explode(split(regexp_replace(c.accts,'^\[|"|\]$',''),',')) a as acct
--remove [] and explode array of json
lateral view explode(split(regexp_replace(t.branchflow,'^\[|\]$',''),'(?<=\}),(?=\{)')) b as acct_branchflow
--this will remove duplicates after lateral view: need only matching acct
where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
结果:
customer_no acct xx_var yy_var value
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]
所以我有一个 Hive 数据样本:
Customer | xx_var | yy_var | branchflow |
---|---|---|---|
{"customer_no":"239230293892839892","acct":["2324325","23425345"]} | 23 | 3 | [{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}] |
我想把它改造成这样的:
Customer_no | acct | xx_var | yy_var | branchflow |
---|---|---|---|---|
239230293892839892 | 2324325 | 23 | 3 | [1,2,3,4,5,6,6,6,4] |
239230293892839892 | 23425345 | 23 | 3 | [1,2,3,4,5,6,6,6,99,4] |
我试过使用此查询,但输出格式错误。
SELECT
customer.customer_no,
acct,
xx_var,
yy_var,
bi_acctno,
values_bi
FROM
struct_test
LATERAL VIEW explode(customer.acct) acct AS acctno
LATERAL VIEW explode(brancflow.acctno) bia as bi_acctno
LATERAL VIEW explode(brancflow.value) biv as values_bi
WHERE bi_acctno = acctno
有谁知道如何解决这个问题?
使用json_tuple提取JSON个元素。在数组的情况下,它 returns 它也作为字符串:删除方括号,拆分和分解。请参阅演示代码中的注释。
演示:
with mytable as (--demo data, use your table instead of this CTE
select '{"customer_no":"239230293892839892","acct":["2324325","23425345"]}' as customer,
23 xx_var, 3 yy_var,
'[{"acctno":"2324325","value":[1,2,3,4,5,6,6,6,4]},{"acctno":"23425345","value":[1,2,3,4,5,6,6,6,99,4]}]' branchflow
)
select c.customer_no,
a.acct,
t.xx_var, t.yy_var,
get_json_object(b.acct_branchflow,'$.value') value
from mytable t
--extract customer_no and acct array
lateral view json_tuple(t.customer, 'customer_no', 'acct') c as customer_no, accts
--remove [] and " and explode array of acct
lateral view explode(split(regexp_replace(c.accts,'^\[|"|\]$',''),',')) a as acct
--remove [] and explode array of json
lateral view explode(split(regexp_replace(t.branchflow,'^\[|\]$',''),'(?<=\}),(?=\{)')) b as acct_branchflow
--this will remove duplicates after lateral view: need only matching acct
where get_json_object(b.acct_branchflow,'$.acctno') = a.acct
结果:
customer_no acct xx_var yy_var value
239230293892839892 2324325 23 3 [1,2,3,4,5,6,6,6,4]
239230293892839892 23425345 23 3 [1,2,3,4,5,6,6,6,99,4]