当没有特定字段来参数化已解析的字符串时,如何从 XML 系列中解析字段并映射到其右父级?
How to parse fields from within an XML family & map to its right parent when there's no specific field to parameterize the parsed strings?
在我的输入XML中,Parents 中有多个“Parent”家庭。在每个 Parent 中,ChildRoot 中可以有一个或多个“ChildRoot”家族。在这个层级中有 4 个字段需要解析。在 Hive XPATH 函数的帮助下,如何在不发生交叉连接的情况下实现这一点?在这种情况下,没有特定字段来参数化已解析的字符串,以便映射到其右父级 (Parents/Parent/ChildRoots/ChildRoot/)。
<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
预期输出:
FieldA FieldB FieldC FieldD
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3
需要的字段可以通过名称找到
xmllint --xpath '/Parents/Parent[1]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info1</FieldA><FieldB>Field B info1</FieldB><FieldC>Field C info1</FieldC><FieldD>Field D info1</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info2</FieldA><FieldC>Field C info2</FieldC><FieldD>Field D info2</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[2]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info3</FieldA><FieldB>Field B info3</FieldB><FieldD>Field D info3</FieldD>
使用 count() XPATH 函数,您可以计算每一层的元素数量。例如 XPATH_INT(xml, 'count(/Parents/Parent)')
给出 2
。
使用横向视图 posexplode 为每个级别生成行并使用分解位置对 XPATH 进行参数化,因此所有这些都将具有一致的位置索引并且不会发生交叉连接。
在这个例子中<Parent>
只有一个child <ChildRoots>
,你可以去掉计数的计算,如果它可以是每个[=中唯一的<ChildRoots>
28=],这会稍微简化代码。我的代码允许每个 <Parent>
.
多个 <ChildRoots>
演示:
with mytable as (
select '<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
' as xml
)
select --Get ChildRoot count and generate rows with position index, build final XPATH
--xml, parent_pos, ChildRoots_pos,
--cr.pos+1 as ChildRoot_pos, --this expression used in xpath as is:
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldA/text()')) as FieldA,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldB/text()')) as FieldB,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldC/text()')) as FieldC,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldD/text()')) as FieldD
from
( --Get ChildRoots count and generate rows with position index
select xml, parent_pos, crs.pos+1 as ChildRoots_pos
--XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)')) as ChildRoots_cnt
from
( --Get Parent count and generate rows with position index
select xml, p.pos+1 as parent_pos
--XPATH_INT(xml, 'count(/Parents/Parent)') as Parent_cnt
from mytable lateral view outer posexplode(split(space(XPATH_INT(xml, 'count(/Parents/Parent)')-1),'')) p as pos, x
) p lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)'))-1),'')) crs as pos, x
) crs lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot)'))-1),'')) cr as pos, x
结果:
fielda fieldb fieldc fieldd
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3
在我的输入XML中,Parents 中有多个“Parent”家庭。在每个 Parent 中,ChildRoot 中可以有一个或多个“ChildRoot”家族。在这个层级中有 4 个字段需要解析。在 Hive XPATH 函数的帮助下,如何在不发生交叉连接的情况下实现这一点?在这种情况下,没有特定字段来参数化已解析的字符串,以便映射到其右父级 (Parents/Parent/ChildRoots/ChildRoot/)。
<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
预期输出:
FieldA FieldB FieldC FieldD
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3
需要的字段可以通过名称找到
xmllint --xpath '/Parents/Parent[1]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info1</FieldA><FieldB>Field B info1</FieldB><FieldC>Field C info1</FieldC><FieldD>Field D info1</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[1]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info2</FieldA><FieldC>Field C info2</FieldC><FieldD>Field D info2</FieldD>
xmllint --xpath '/Parents/Parent[2]/ChildRoots/ChildRoot[2]/child::*[name()="FieldA" or name()="FieldB" or name()="FieldC" or name()="FieldD"]' test.xml ; echo
# <FieldA>Field A info3</FieldA><FieldB>Field B info3</FieldB><FieldD>Field D info3</FieldD>
使用 count() XPATH 函数,您可以计算每一层的元素数量。例如 XPATH_INT(xml, 'count(/Parents/Parent)')
给出 2
。
使用横向视图 posexplode 为每个级别生成行并使用分解位置对 XPATH 进行参数化,因此所有这些都将具有一致的位置索引并且不会发生交叉连接。
在这个例子中<Parent>
只有一个child <ChildRoots>
,你可以去掉计数的计算,如果它可以是每个[=中唯一的<ChildRoots>
28=],这会稍微简化代码。我的代码允许每个 <Parent>
.
<ChildRoots>
演示:
with mytable as (
select '<Parents>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info1</FieldA>
<FieldB>Field B info1</FieldB>
<FieldC>Field C info1</FieldC>
<FieldD>Field D info1</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
<Parent>
<ChildRoots>
<ChildRoot>
<FieldA>Field A info2</FieldA>
<FieldC>Field C info2</FieldC>
<FieldD>Field D info2</FieldD>
</ChildRoot>
<ChildRoot>
<FieldA>Field A info3</FieldA>
<FieldB>Field B info3</FieldB>
<FieldD>Field D info3</FieldD>
</ChildRoot>
</ChildRoots>
</Parent>
</Parents>
' as xml
)
select --Get ChildRoot count and generate rows with position index, build final XPATH
--xml, parent_pos, ChildRoots_pos,
--cr.pos+1 as ChildRoot_pos, --this expression used in xpath as is:
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldA/text()')) as FieldA,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldB/text()')) as FieldB,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldC/text()')) as FieldC,
XPATH_STRING(xml, concat('/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot[',cr.pos+1,']/FieldD/text()')) as FieldD
from
( --Get ChildRoots count and generate rows with position index
select xml, parent_pos, crs.pos+1 as ChildRoots_pos
--XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)')) as ChildRoots_cnt
from
( --Get Parent count and generate rows with position index
select xml, p.pos+1 as parent_pos
--XPATH_INT(xml, 'count(/Parents/Parent)') as Parent_cnt
from mytable lateral view outer posexplode(split(space(XPATH_INT(xml, 'count(/Parents/Parent)')-1),'')) p as pos, x
) p lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots)'))-1),'')) crs as pos, x
) crs lateral view outer posexplode(split(space(XPATH_INT(xml, concat('count(/Parents/Parent[',parent_pos,']/ChildRoots[',ChildRoots_pos,']/ChildRoot)'))-1),'')) cr as pos, x
结果:
fielda fieldb fieldc fieldd
Field A info1 Field B info1 Field C info1 Field D info1
Field A info2 Field C info2 Field D info2
Field A info3 Field B info3 Field D info3