在 Hive 中,如何在子家族内和子家族外分解 XML 中的标签并适当地按家族映射它们?
In Hive, how to explode tags in an XML within subfamilies & outside subfamilies and map them appropriately familywise?
在下面提供的 XML 中,我需要分解 {Name, Value} 对以及 ParentID 标签并适当地映射它们,“Parent” familywise:
<Parents>
<Parent>
<ParentID>12345</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>54321</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>222</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>12534</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>333</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>51342</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>333</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>444</Value>
</ParentField>
</ParentArray>
</Parent>
</Parents>
预期输出:
ParentID Name Value
12345 ABCD 111
54321 ABCD 111
54321 CDBA 222
12534 ABCD 111
12534 ABCD 222
12534 CDBA 333
51342 ABCD 111
51342 ABCD 222
51342 ABCD 333
51342 CDBA 444
在每个Parent 家庭中,都有一个ParentID 标签。同样在 ParentArray 子族中,存在多个带有 {Name, Value} 对的 ParentField 子族。需要在每个 Parent 系列中将 ParentID 与其 {Name, Value} 对正确映射。
通过数组中的值和位置组合 XPATH 过滤。看代码中的注释:
with your_data as (
select '<Parents>
<Parent>
<ParentID>12345</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>54321</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>222</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>12534</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>333</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>51342</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>333</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>444</Value>
</ParentField>
</ParentArray>
</Parent>
</Parents>
' as xmlinfo
)
select p.parentid, n.name, -- n.pos+1,
--filter by parentid, name and position and extract scalar
XPATH_STRING(xmlinfo,concat('(((Parents/Parent)[ParentID="',p.parentid,'"])/ParentArray/ParentField[',n.pos+1,'])[Name="',n.name,'"]/Value/text()')) as value
from your_data d
lateral view explode(XPATH(xmlinfo, 'Parents/Parent/ParentID/text()')) p as parentid
--filer by parentID to get array of Name with position inside ParentArray
lateral view posexplode(XPATH(xmlinfo, concat('(Parents/Parent)[ParentID="',p.parentid,'"]/ParentArray/ParentField/Name/text()'))) n as pos, name
;
结果:
p.parentid n.name value
12345 ABCD 111
54321 ABCD 111
54321 CDBA 222
12534 ABCD 111
12534 ABCD 222
12534 CDBA 333
51342 ABCD 111
51342 ABCD 222
51342 ABCD 333
51342 CDBA 444
在下面提供的 XML 中,我需要分解 {Name, Value} 对以及 ParentID 标签并适当地映射它们,“Parent” familywise:
<Parents>
<Parent>
<ParentID>12345</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>54321</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>222</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>12534</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>333</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>51342</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>333</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>444</Value>
</ParentField>
</ParentArray>
</Parent>
</Parents>
预期输出:
ParentID Name Value
12345 ABCD 111
54321 ABCD 111
54321 CDBA 222
12534 ABCD 111
12534 ABCD 222
12534 CDBA 333
51342 ABCD 111
51342 ABCD 222
51342 ABCD 333
51342 CDBA 444
在每个Parent 家庭中,都有一个ParentID 标签。同样在 ParentArray 子族中,存在多个带有 {Name, Value} 对的 ParentField 子族。需要在每个 Parent 系列中将 ParentID 与其 {Name, Value} 对正确映射。
通过数组中的值和位置组合 XPATH 过滤。看代码中的注释:
with your_data as (
select '<Parents>
<Parent>
<ParentID>12345</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>54321</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>222</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>12534</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>333</Value>
</ParentField>
</ParentArray>
</Parent>
<Parent>
<ParentID>51342</ParentID>
<ParentArray>
<ParentField>
<Name>ABCD</Name>
<Value>111</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>222</Value>
</ParentField>
<ParentField>
<Name>ABCD</Name>
<Value>333</Value>
</ParentField>
<ParentField>
<Name>CDBA</Name>
<Value>444</Value>
</ParentField>
</ParentArray>
</Parent>
</Parents>
' as xmlinfo
)
select p.parentid, n.name, -- n.pos+1,
--filter by parentid, name and position and extract scalar
XPATH_STRING(xmlinfo,concat('(((Parents/Parent)[ParentID="',p.parentid,'"])/ParentArray/ParentField[',n.pos+1,'])[Name="',n.name,'"]/Value/text()')) as value
from your_data d
lateral view explode(XPATH(xmlinfo, 'Parents/Parent/ParentID/text()')) p as parentid
--filer by parentID to get array of Name with position inside ParentArray
lateral view posexplode(XPATH(xmlinfo, concat('(Parents/Parent)[ParentID="',p.parentid,'"]/ParentArray/ParentField/Name/text()'))) n as pos, name
;
结果:
p.parentid n.name value
12345 ABCD 111
54321 ABCD 111
54321 CDBA 222
12534 ABCD 111
12534 ABCD 222
12534 CDBA 333
51342 ABCD 111
51342 ABCD 222
51342 ABCD 333
51342 CDBA 444