当没有特定字段来参数化已解析的字符串时，如何从 XML 系列中解析字段并映射到其右父级？

Question

在我的输入XML中，Parents 中有多个“Parent”家庭。在每个 Parent 中，ChildRoot 中可以有一个或多个“ChildRoot”家族。在这个层级中有 4 个字段需要解析。在 Hive XPATH 函数的帮助下，如何在不发生交叉连接的情况下实现这一点？在这种情况下，没有特定字段来参数化已解析的字符串，以便映射到其右父级 (Parents/Parent/ChildRoots/ChildRoot/)。

<Parents>
    <Parent>
        <ChildRoots>
            <ChildRoot>
                <FieldA>Field A info1</FieldA>
                <FieldB>Field B info1</FieldB>
                <FieldC>Field C info1</FieldC>
                <FieldD>Field D info1</FieldD>
            </ChildRoot>
        </ChildRoots>
    </Parent>
    <Parent>
        <ChildRoots>
            <ChildRoot>
                <FieldA>Field A info2</FieldA>
                <FieldC>Field C info2</FieldC>
                <FieldD>Field D info2</FieldD>
            </ChildRoot>
            <ChildRoot>
                <FieldA>Field A info3</FieldA>
                <FieldB>Field B info3</FieldB>
                <FieldD>Field D info3</FieldD>
            </ChildRoot>
        </ChildRoots>
    </Parent>
</Parents>

预期输出：

FieldA          FieldB          FieldC          FieldD
Field A info1   Field B info1   Field C info1   Field D info1
Field A info2                   Field C info2   Field D info2
Field A info3   Field B info3                   Field D info3

Answer 1

需要的字段可以通过名称找到

xmllint --xpath '/Parents/Parent[1]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info1</FieldA><FieldB>Field B info1</FieldB><FieldC>Field C info1</FieldC><FieldD>Field D info1</FieldD>

xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info2</FieldA><FieldC>Field C info2</FieldC><FieldD>Field D info2</FieldD>

xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[2]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info3</FieldA><FieldB>Field B info3</FieldB><FieldD>Field D info3</FieldD>

Answer 2

使用 count() XPATH 函数，您可以计算每一层的元素数量。例如 XPATH_INT(xml, 'count(/Parents/Parent)') 给出 2。使用横向视图 posexplode 为每个级别生成行并使用分解位置对 XPATH 进行参数化，因此所有这些都将具有一致的位置索引并且不会发生交叉连接。

在这个例子中<Parent>只有一个child <ChildRoots>，你可以去掉计数的计算，如果它可以是每个[=中唯一的<ChildRoots> 28=]，这会稍微简化代码。我的代码允许每个 <Parent>.

多个 <ChildRoots>

演示：

with mytable as (
select '<Parents>
    <Parent>
        <ChildRoots>
            <ChildRoot>
                <FieldA>Field A info1</FieldA>
                <FieldB>Field B info1</FieldB>
                <FieldC>Field C info1</FieldC>
                <FieldD>Field D info1</FieldD>
            </ChildRoot>
        </ChildRoots>
    </Parent>
    <Parent>
        <ChildRoots>
            <ChildRoot>
                <FieldA>Field A info2</FieldA>
                <FieldC>Field C info2</FieldC>
                <FieldD>Field D info2</FieldD>
            </ChildRoot>
            <ChildRoot>
                <FieldA>Field A info3</FieldA>
                <FieldB>Field B info3</FieldB>
                <FieldD>Field D info3</FieldD>
            </ChildRoot>
        </ChildRoots>
    </Parent>
</Parents>
' as xml
)

select --Get ChildRoot count and generate rows with position index, build final XPATH
       --xml, parent_pos, ChildRoots_pos, 
       --cr.pos+1 as ChildRoot_pos, --this expression used in xpath as is:
       XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldA/text()')) as FieldA,
       XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldB/text()')) as FieldB,
       XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldC/text()')) as FieldC,
       XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldD/text()')) as FieldD
from
( --Get ChildRoots count and generate rows with position index
select xml, parent_pos, crs.pos+1 as ChildRoots_pos
       --XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)')) as ChildRoots_cnt  
from
  ( --Get Parent count and generate rows with position index
    select  xml, p.pos+1 as parent_pos
          --XPATH_INT(xml, 'count(/Parents/Parent)') as Parent_cnt
     from mytable lateral view outer posexplode(split(space(XPATH_INT(xml, 'count(/Parents/Parent)')-1),'')) p as  pos, x
   ) p lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)'))-1),'')) crs as  pos, x 
) crs lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot)'))-1),'')) cr as  pos, x

结果：

fielda           fieldb           fieldc           fieldd
Field A info1    Field B info1    Field C info1    Field D info1
Field A info2                     Field C info2    Field D info2
Field A info3    Field B info3                     Field D info3

当没有特定字段来参数化已解析的字符串时，如何从 XML 系列中解析字段并映射到其右父级？

How to parse fields from within an XML family & map to its right parent when there's no specific field to parameterize the parsed strings?

xml

sql

xpath

hive

hiveql