在 Hive 中,如何分解在 XML 中多次出现的相同父标签下的相同子标签?
In Hive, how to explode same child-tags under the same parent-tags appearing multiple times within an XML?
在下面的 Hive-query 中,XML 由 Parents 标签组成,有 4 个 Parent 家族和 4 个 ParentArray个家族内。在每个 ParentArray 下,有 ParentFieldArray 个事件,由相同的名称和值标签(ABCD 和 111 分别).
with your_data as (
select '<Parents>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
</Parents>' as xmlinfo
)
select name, pos+1 as pos, value
from your_data d
lateral view outer posexplode(XPATH(xmlinfo, 'Parents/Parent/ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('Parents/Parent/ParentArray/ParentFieldArray[',pf.pos+1, '][Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
上面的查询正在填充第一个索引本身下的所有“111”行和索引 2、3 和 4 下的 NULL 值。
查询的预期输出:
name pos value
ABCD 1 111
ABCD 2 111
ABCD 3 111
ABCD 4 111
这是 XPATH 中的错误。 [] 优先并产生奇怪的结果。使用括号。
with your_data as (
select '<Parents>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
</Parents>' as xmlinfo
)
select pos+1 as pos, Name, Value
from your_data d
lateral view outer posexplode(XPATH(xmlinfo, 'Parents/Parent/ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('((Parents/Parent/ParentArray/ParentFieldArray)[',pf.pos+1, '])[Name="', pf.Name, '"]/Value/string/text()'))) vl as value
;
结果:
pos name value
1 ABCD 111
2 ABCD 111
3 ABCD 111
4 ABCD 111
在下面的 Hive-query 中,XML 由 Parents 标签组成,有 4 个 Parent 家族和 4 个 ParentArray个家族内。在每个 ParentArray 下,有 ParentFieldArray 个事件,由相同的名称和值标签(ABCD 和 111 分别).
with your_data as (
select '<Parents>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
</Parents>' as xmlinfo
)
select name, pos+1 as pos, value
from your_data d
lateral view outer posexplode(XPATH(xmlinfo, 'Parents/Parent/ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('Parents/Parent/ParentArray/ParentFieldArray[',pf.pos+1, '][Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
上面的查询正在填充第一个索引本身下的所有“111”行和索引 2、3 和 4 下的 NULL 值。
查询的预期输出:
name pos value
ABCD 1 111
ABCD 2 111
ABCD 3 111
ABCD 4 111
这是 XPATH 中的错误。 [] 优先并产生奇怪的结果。使用括号。
with your_data as (
select '<Parents>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
<Parent>
<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
</Value>
</ParentFieldArray>
</ParentArray>
</Parent>
</Parents>' as xmlinfo
)
select pos+1 as pos, Name, Value
from your_data d
lateral view outer posexplode(XPATH(xmlinfo, 'Parents/Parent/ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('((Parents/Parent/ParentArray/ParentFieldArray)[',pf.pos+1, '])[Name="', pf.Name, '"]/Value/string/text()'))) vl as value
;
结果:
pos name value
1 ABCD 111
2 ABCD 111
3 ABCD 111
4 ABCD 111