在 Hive 中,如何分解 XML 中存在的相同父标签下的子标签?
In Hive, how to explode child-tags under identical parent-tags present within an XML?
在下面的 Hive 查询中,我需要将子标签映射到具有相同值的父标签下 XML 内容。截至目前,cross join 正在发生,因为父标记值“ABCD”在此处重复。
with your_data as (
select '<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
<string></string>
</Value>
</ParentFieldArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string/>
<string>444</string>
<string>555</string>
</Value>
</ParentFieldArray>
</ParentArray>' as xmlinfo
)
select name, case when value='NULL' then '' else value end value
from (select regexp_replace(xmlinfo,'<string></string>|<string/>','<string>NULL</string>') xmlinfo
from your_data d
) d
lateral view outer explode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as Name
lateral view outer explode(XPATH(xmlinfo, concat('ParentArray/ParentFieldArray[Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
查询的预期输出:
Name Value
ABCD 111
ABCD
ABCD
ABCD 444
ABCD 555
除了名称之外,您还可以使用 posexplode() 而不是 explode() 来获取位置。然后在第二个 XPATH 中按位置过滤数组,在这种情况下你可能不需要名称过滤器,在更大的数据集上调试它。我同时使用了:名称和索引过滤器,它适用于您的数据示例。 XPATH 中的位置从 1 开始,Hive posexplode 中的位置从 0 开始,这就是为什么使用 pos+1 的原因:
with your_data as (
select '<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
<string></string>
</Value>
</ParentFieldArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string/>
<string>444</string>
<string>555</string>
</Value>
</ParentFieldArray>
</ParentArray>' as xmlinfo
)
select name, pos+1 as pos, case when value='NULL' then '' else value end value
from (select regexp_replace(xmlinfo,'<string></string>|<string/>','<string>NULL</string>') xmlinfo
from your_data d
) d
lateral view outer posexplode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('((ParentArray/ParentFieldArray)[',pf.pos+1, '])[Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
结果:
name pos value
ABCD 1 111
ABCD 1
ABCD 2
ABCD 2 444
ABCD 2 555
在下面的 Hive 查询中,我需要将子标签映射到具有相同值的父标签下 XML 内容。截至目前,cross join 正在发生,因为父标记值“ABCD”在此处重复。
with your_data as (
select '<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
<string></string>
</Value>
</ParentFieldArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string/>
<string>444</string>
<string>555</string>
</Value>
</ParentFieldArray>
</ParentArray>' as xmlinfo
)
select name, case when value='NULL' then '' else value end value
from (select regexp_replace(xmlinfo,'<string></string>|<string/>','<string>NULL</string>') xmlinfo
from your_data d
) d
lateral view outer explode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as Name
lateral view outer explode(XPATH(xmlinfo, concat('ParentArray/ParentFieldArray[Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
查询的预期输出:
Name Value
ABCD 111
ABCD
ABCD
ABCD 444
ABCD 555
除了名称之外,您还可以使用 posexplode() 而不是 explode() 来获取位置。然后在第二个 XPATH 中按位置过滤数组,在这种情况下你可能不需要名称过滤器,在更大的数据集上调试它。我同时使用了:名称和索引过滤器,它适用于您的数据示例。 XPATH 中的位置从 1 开始,Hive posexplode 中的位置从 0 开始,这就是为什么使用 pos+1 的原因:
with your_data as (
select '<ParentArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string>111</string>
<string></string>
</Value>
</ParentFieldArray>
<ParentFieldArray>
<Name>ABCD</Name>
<Value>
<string/>
<string>444</string>
<string>555</string>
</Value>
</ParentFieldArray>
</ParentArray>' as xmlinfo
)
select name, pos+1 as pos, case when value='NULL' then '' else value end value
from (select regexp_replace(xmlinfo,'<string></string>|<string/>','<string>NULL</string>') xmlinfo
from your_data d
) d
lateral view outer posexplode(XPATH(xmlinfo, 'ParentArray/ParentFieldArray/Name/text()')) pf as pos, Name
lateral view outer explode(XPATH(xmlinfo, concat('((ParentArray/ParentFieldArray)[',pf.pos+1, '])[Name="', pf.Name, '"]/Value/string/text()'))) vl as value;
结果:
name pos value
ABCD 1 111
ABCD 1
ABCD 2
ABCD 2 444
ABCD 2 555