如何在 Hive 中为 XML 数据格式使用横向视图爆炸?

How to use lateral view explode in Hive for XML data format?

我正在尝试将 XML 格式的销售数据加载到 Hive table。 下面是一小部分数据样本。

我知道我可以将下面的数据加载到 Hive,如果我将它分成几个 table,然后根据需要加入它们。但只是想知道我是否可以将它们加载到单个 table 中,并且预期的输出应该类似于所附的屏幕截图。

请帮助我了解我应该使用的 table 结构,以及如何有效地使用横向视图分解选项来实现此目的。

示例数据:

  <Store>
    <Version>1.1</Version>
    <StoreId>16695</StoreId>    
    <Bskt>
      <TillNo>4</TillNo>
      <BsktNo>1753</BsktNo>
      <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
      <OpID>50056</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>29559</GTIN>
        <ItmDsc>CHOCALATE</ItmDsc>
      <ItmProm>
          <PromCD>CM</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>59653</GTIN>
        <ItmDsc>CORN FLAKES</ItmDsc>
      </Itm>
        <Itm>
        <ItmSeq>3</ItmSeq>
        <GTIN>42260</GTIN>
        <ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
        <ItmProm>
          <PromCD>MTSRO</PromCD>
          <OfferID>11766</OfferID>
        </ItmProm>
      </Itm>
    </Bskt>
    <Bskt>
      <TillNo>5</TillNo>
      <BsktNo>1947</BsktNo>
      <DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
      <OpID>50063</OpID>
      <Itm>
        <ItmSeq>1</ItmSeq>
        <GTIN>24064</GTIN>
        <ItmDsc>TOMATOES 2KG</ItmDsc>
        <ItmProm>
          <PromCD>INSTORE</PromCD>
        </ItmProm>
      </Itm>
      <Itm>
        <ItmSeq>2</ItmSeq>
        <GTIN>81287</GTIN>
        <ItmDsc>ROTHMANS BLUE</ItmDsc>
        <ItmProm>
          <PromCD>TF</PromCD>
        </ItmProm>
      </Itm>
    </Bskt>
  </Store>  

期望的输出

enter image description here

Table结构:

CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
    TBLPROPERTIES (
    "xmlinput.start"="<Store","xmlinput.end"="</Store>"
);

输出: enter image description here

尝试了以下查询来读取数据,它没有按照我想要的方式显示结果。

select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
    LATERAL VIEW explode(DateTime) table1 as basket_dtm 
    LATERAL VIEW explode(BsktNo) table2 as basket_number
    LATERAL VIEW explode(TillNo) table3 as till_number;

结果:

enter image description here

数组对象的分解类似于交叉连接。 因此,如果您有 3 列,每列包含具有 2 个元素的数组,则对所有列应用 explode 将为您提供 8 行。

您不能将一个对象从数组映射到另一个对象。

实际上,您可以使用 posexplode,它为每个元素提供 index。您可以使用它来根据条件加入。但是,当您有多个列并且每列的数组大小不同时,这就很棘手了。

解决方案

  • 如果要展开的列较少且数组大小相同,请使用 posexplode。对于您的情况,这是行不通的。所以
  • Store XML as Complex Data Type : 将整个 XML 存储为复杂数据类型(不仅仅是数组),我说的是创建struct 基于你的 xml。 如果你没有太多的复杂xml,你可以实现这个。但是,在将文件转换为复杂数据类型时,xmlSerde 不如 JSONserde

因此,在您的情况下,最佳解决方案是

  • 将您的 XML 转换为 JSON。您可以为此使用 NiFi 或其他一些技术。
  • 使用 JSONserde 创建 Hive table 并加载此文件。
  • 根据您的要求创建视图。

JSON 为您的 XML

{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}
如果文件中有制表符或其他空格,

JsonSerde 可能会报错。所以最好删除它们。

蜂巢Table

create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
                    BsktNo:string,
                    DateTime:string,
                    OpID:string,
                    TillNo:string,
                    Itm:array<struct<
                                        GTIN:string,
                                        ItmDsc:string,
                                        ItmSeq:string,
                                        ItmProm:struct<
                                                        OfferID:string,
                                                        PromCD:string
                                                        >

                                    >
                            >
                >
            >
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';

创建视图

SELECT Version,
         StoreId,
         basket.bsktno,
         basket.tillno,
         basket.`datetime`,
         item.itmseq,
         item.itmdsc,
         item.gtin,
         item.itmprom.offerid,
         item.itmprom.promcd
FROM temp.test_json 
lateral view explode(bskt) b AS basket 
lateral view explode(basket.itm) i AS item

感谢您提供详细的解决方案。我测试了它,它工作得很好。 我尝试了一种类似的方法来直接使用 XML serde.

从 XML 读取数据

我的挑战:

1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.

下面是 Hive table 结构和我用来读取数据的查询。 我能够成功地爆炸第一级阵列 (Bskt),没有任何问题。

但是当我尝试展开第二级数组 (Itm) 时,returns 'Itm' 中所有字段的结果为 NULL。

我的查询或 table 结构本身有问题吗?

create external table nanda_scan_xml  (
  Version string,
  StoreId string,
  Bskt array<struct<
                    Bskt:struct<
                                DateTime:string,
                                TillNo:string,
                                BsktNo:string,
                                Itm:array<struct<
                                                Itm:struct<
                                                    ItmSeq:string,      
                                                    GTIN:string,        
                                                    ItmDsc:string,      
                                                    DeptCD:string,      
                                                    ItmCD:string,       
                                                    SalesQTY:string,        
                                                    SalesExGST:string,      
                                                    Points:string,      
                                                    CostExGST:string,       
                                                    GSTRate:string,     
                                                    DiscAmtExGST:string,        
                                                    ItmProm:struct<     
                                                                    PromCD:string,      
                                                                    OfferID:string      
                                                                  >
                                                              >
                                                     >
                                            >
                                >
                    >
            >
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties 
(
    "column.xpath.Version"       = "/Store/Version/text()",
    "column.xpath.StoreId"       = "/Store/StoreId/text()",
    "column.xpath.Bskt"  = "/Store/Bskt"

)
stored as 
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
tblproperties 
(
    "xmlinput.start"    = "<Store>",
    "xmlinput.end"      = "</Store>"
);

查询:

1)对于工作正常的 Bskt:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket;

结果:

enter image description here 2) 在单个查询中尝试两个横向视图爆炸时:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;

结果:

enter image description here

3) 查询:

SELECT  Version,
        StoreId,
        basket.Bskt.DateTime,
        basket.Bskt.bsktno,
        basket.Bskt.tillno,
        item.Itm.ItmSeq,
        item.Itm.ItmDsc,
        item.Itm.GTIN,
        item.Itm.itmprom.OfferID,
        item.Itm.itmprom.PromCD 
FROM eim_stg.nanda_scan_xml
LATERAL VIEW EXPLODE(Bskt) b AS basket
LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;

错误:

enter image description here