当只有 1 个实例时,雪花 XML 解析不适​​用于嵌套结构

Snowflake XML parsing not working for nested structure when there is only 1 instance

我们在 Snowflake 中有一个名为 "portfolio" 的分段 table,它有一个名为 "cdc_xml" 的变体列,用于存储 Snowpipe 通过 S3 加载的 XML 文档。

XML 看起来像:

<xyz>
<jmsTimestamp>1570068080385</jmsTimestamp>
<portfolio>
<id>1234</id>
<portfolioNumber>909</portfolioNumber>
<portfolioName>Hello World</portfolioName>
<master>
  <attribute fieldName="active" value="1" oldValue="0"/>
  <attribute fieldName="name" value="Hello Co" oldValue="Hello Company"/>
  <attribute fieldName="startDate" value="04/02/1988" oldValue="04/01/1988"/>
</master>
<characteristics>
  <characteristic fieldName="currency" value="JPY" oldValue="USD"/>
  <characteristic fieldName="duplicate" value="YES" oldValue="NO"/>
  <characteristic fieldName="clone" value="TRUE" oldValue="FALSE"/>
</characteristics>
</portfolio>
</xyz>

下面是 Snowflake 横向展平代码,据称解析 XML 以检索 <master><attribute> 级别的所有“@fieldName”和“@value”以及所有“@fieldName”和<characteristics><characteristic> 级别的“@value”。所有这些数据都将作为名称-值对检索。

-- flatten the characteristics nested structure to get all characteristic nvps
select 'XYZ' as source_name,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'id'):"$"::string as source_portfolio_id,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'portfolioNumber'):"$"::string as portfolio_number,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'portfolioName'):"$"::string as name,
       get(flt1.value, '@fieldName')::string as field_name,
       nvl(decode(get(flt1.value, '@value')::string, '', null, get(flt1.value, '@value')::string), '\b') as field_value -- deletion CDC if new value is null or empty
  from staging.portfolio src1,
       lateral flatten(xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'characteristics'):"$") flt1
 union
-- flatten the master nested structure to get all attribute nvps
select 'XYZ' as source_name,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'id'):"$"::string as source_portfolio_id,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'portfolioNumber'):"$"::string as portfolio_number,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'portfolioName'):"$"::string as name,
       get(flt2.value, '@fieldName')::string as field_name,
       nvl(decode(get(flt2.value, '@value')::string, '', null, get(flt2.value, '@value')::string), '\b') as field_value -- deletion CDC if new value is null or empty
  from staging.portfolio src2,
       lateral flatten(xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'master'):"$") flt2

它适用于上面提供的示例。但是,如果 XML 如下所示(只有 1 个嵌套 <master><attribute> 结构实例),则无法解析 <master><attribute> 的 1 个实例及其“@fieldName”和“ @value" 都是 NULL(而不是 "startDate" 和 "11/02/1988")。

类似地,如果 XML 看起来像底部的那个(只有 1 个嵌套 <characteristics><characteristic> 结构的实例),那么 <characteristics><characteristic> 的 1 个实例无法被解析并且它的“@fieldName”和“@value”都是NULL(而不是"clone"和"TRUE")。

感谢任何帮助。提前致谢!

<xyz>
<jmsTimestamp>1570068080300</jmsTimestamp>
<portfolio>
<id>9876</id>
<portfolioNumber>808</portfolioNumber>
<portfolioName>Another Example</portfolioName>
<master>
  <attribute fieldName="startDate" value="11/02/1988" oldValue="11/01/1988"/>
</master>
<characteristics>
  <characteristic fieldName="currency" value="JPY" oldValue="USD"/>
  <characteristic fieldName="duplicate" value="YES" oldValue="NO"/>
  <characteristic fieldName="clone" value="TRUE" oldValue="FALSE"/>
</characteristics>
</portfolio>
</xyz>

<xyz>
<jmsTimestamp>1570068080300</jmsTimestamp>
<portfolio>
<id>9876</id>
<portfolioNumber>808</portfolioNumber>
<portfolioName>Another Example</portfolioName>
<master>
  <attribute fieldName="active" value="0" oldValue="1"/>
  <attribute fieldName="name" value="Example Inc" oldValue="Example LLC"/>
  <attribute fieldName="startDate" value="11/02/1988" oldValue="11/01/1988"/>
</master>
<characteristics>
  <characteristic fieldName="clone" value="TRUE" oldValue="FALSE"/>
</characteristics>
</portfolio>
</xyz>

我们有同样的问题 JavaScript 库将 XML 解析为 JSON,我们必须拉出节点 Master 然后检查它是否是一个数组,如果不将其转换为数组。

幸运的是,雪花中似乎有 IS_ARRAY semistructured functions

所以如果 IS_ARRAYTO_ARRAY 像预期的那样工作,那么这应该工作:

select source_name,
    source_portfolio_id,
    portfolio_number,
    name,
    get(flt2.value, '@fieldName')::string as field_name,
    nvl(decode(get(flt2.value, '@value')::string, '', null, get(flt2.value, '@value')::string), '\b') as field_value -- deletion CDC if new value is null or empty
from (
    select 'XYZ' as source_name,
           xmlget(portfolio, 'id'):"$"::string as source_portfolio_id,
           xmlget(portfolio, 'portfolioNumber'):"$"::string as portfolio_number,
           xmlget(portfolio, 'portfolioName'):"$"::string as name,
           xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'master'):"$" AS master_raw
           IFF(IS_ARRAY(master_raw), master_raw, TO_ARRAY(master_raw)) as master
    from (
        select xmlget(src2.cdc_xml, 'portfolio') as portfolio
        from staging.portfolio src2
    )
),
lateral flatten(master) flt2

与 Simeon Pilgrim 刚刚提供的解决方案非常相似,您可以无条件地 将每个元素列表转换为数组,以避免 FLATTEN 尝试 "explode"将元素转化为它的组件属性(这就是您正在经历的)。所以,这也行得通:

select 'XYZ' as source_name,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'id'):"$"::string as source_portfolio_id,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'portfolioNumber'):"$"::string as portfolio_number,
       xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'portfolioName'):"$"::string as name,
       get(flt1.value, '@fieldName')::string as field_name,
       nvl(decode(get(flt1.value, '@value')::string, '', null, get(flt1.value, '@value')::string), '\b') as field_value -- deletion CDC if new value is null or empty
  from staging.portfolio src1,
       lateral flatten(to_array(xmlget(xmlget(src1.cdc_xml, 'portfolio'), 'characteristics'):"$")) flt1
 union
-- flatten the master nested structure to get all attribute nvps
select 'XYZ' as source_name,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'id'):"$"::string as source_portfolio_id,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'portfolioNumber'):"$"::string as portfolio_number,
       xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'portfolioName'):"$"::string as name,
       get(flt2.value, '@fieldName')::string as field_name,
       nvl(decode(get(flt2.value, '@value')::string, '', null, get(flt2.value, '@value')::string), '\b') as field_value -- deletion CDC if new value is null or empty
  from staging.portfolio src2,
       lateral flatten(to_array(xmlget(xmlget(src2.cdc_xml, 'portfolio'), 'master'):"$")) flt2```