在 Hive 中,如何在子家族内和子家族外分解 XML 中的标签并适当地按家族映射它们?

In Hive, how to explode tags in an XML within subfamilies & outside subfamilies and map them appropriately familywise?

在下面提供的 XML 中,我需要分解 {Name, Value} 对以及 ParentID 标签并适当地映射它们,“Parent” familywise:

<Parents>
    <Parent>
        <ParentID>12345</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>54321</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>222</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>12534</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>222</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>333</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>51342</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>222</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>333</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>444</Value>
            </ParentField>
        </ParentArray>
    </Parent>
</Parents>

预期输出:

ParentID    Name    Value
12345       ABCD    111
54321       ABCD    111
54321       CDBA    222
12534       ABCD    111
12534       ABCD    222
12534       CDBA    333
51342       ABCD    111
51342       ABCD    222
51342       ABCD    333
51342       CDBA    444

在每个Parent 家庭中,都有一个ParentID 标签。同样在 ParentArray 子族中,存在多个带​​有 {Name, Value} 对的 ParentField 子族。需要在每个 Parent 系列中将 ParentID 与其 {Name, Value} 对正确映射。

通过数组中的值和位置组合 XPATH 过滤。看代码中的注释:

with your_data as (
    select  '<Parents>
    <Parent>
        <ParentID>12345</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>54321</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>222</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>12534</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>222</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>333</Value>
            </ParentField>
        </ParentArray>
    </Parent>
    <Parent>
        <ParentID>51342</ParentID>
        <ParentArray>
            <ParentField>
                <Name>ABCD</Name>
                <Value>111</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>222</Value>
            </ParentField>
            <ParentField>
                <Name>ABCD</Name>
                <Value>333</Value>
            </ParentField>
            <ParentField>
                <Name>CDBA</Name>
                <Value>444</Value>
            </ParentField>
        </ParentArray>
    </Parent>
</Parents>
' as xmlinfo
)

select p.parentid, n.name, -- n.pos+1, 
       --filter by parentid, name and position and extract scalar
       XPATH_STRING(xmlinfo,concat('(((Parents/Parent)[ParentID="',p.parentid,'"])/ParentArray/ParentField[',n.pos+1,'])[Name="',n.name,'"]/Value/text()')) as value 
 from your_data d
      lateral view explode(XPATH(xmlinfo, 'Parents/Parent/ParentID/text()')) p as parentid
       --filer by parentID to get array of Name with position inside ParentArray 
      lateral view posexplode(XPATH(xmlinfo, concat('(Parents/Parent)[ParentID="',p.parentid,'"]/ParentArray/ParentField/Name/text()'))) n as pos, name
;

结果:

p.parentid  n.name  value
12345   ABCD    111
54321   ABCD    111
54321   CDBA    222
12534   ABCD    111
12534   ABCD    222
12534   CDBA    333
51342   ABCD    111
51342   ABCD    222
51342   ABCD    333
51342   CDBA    444