使用具有多个属性的 scala-xml API 进行解析

Parsing with scala-xml API with multiple attributes

我有 XML,我正在尝试使用 Scala XML API。我有 XPath 查询来从 XML 标签中检索数据。我想从 <market> 中检索 <price> 标签值,但使用 _idtype 这两个属性。我想用 && 写一个条件,这样我就会得到每个价格标签的唯一值,例如其中 MARKET _ID = 1 && TYPE = "A".

作为参考,在下面找到 XML:

<publisher>
    <book _id = "0"> 
        <author _id="0">Dev</author>
        <publish_date>24 Feb 1995</publish_date>
        <description>Data Structure - C</description>
        <market _id="0" type="A">
            <price>45.95</price>            
        </market>
        <market _id="0" type="B">
            <price>55.95</price>
        </market>
    </book>
    <book _id="1"> 
        <author _id = "1">Ram</author>
        <publish_date>02 Jul 1999</publish_date>
        <description>Data Structure - Java</description>
        <market _id="1" type="A">
            <price>145.95</price>           
        </market>   
        <market _id="1" type="B">
            <price>155.95</price>           
        </market>
    </book>
</publisher>

以下代码运行良好

import scala.xml._

object XMLtoCSV extends App {

  val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")  

  val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text  //45.95
  val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text  //155.95

  println("price = " + price)
  println("price1 = " + price1)
} 

输出为:

price = 45.9555.95
price1 = 145.95155.95

我上面的代码给了我两个值,因为我无法放置 && 条件。

  1. 除过滤外,请告知我可以使用哪些 SCALA 函数。
  2. 另外让我知道如何获取所有属性名称。
  3. 如果可能,请告诉我从哪里可以阅读所有这些 API。

提前致谢。

您可以编写自定义谓词来检查多个属性:

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
  node.attribute("_id").exists(_.text == marketId) &&
  node.attribute("type").exists(_.text == marketType)
}

然后将其用作过滤器:

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95

如果您有兴趣获取数据的 CSV 文件,这就是编写它的方式:

(xmlload \ "book").flatMap { bk =>
  (bk \ "market").flatMap { mkt =>
    (mkt \ "price").map { p =>
      Seq(
        bk \@ "_id",
        mkt \@ "_id",
        mkt \@ "type",
        p.text.toFloat
      )
    }
  }
}.map { cols =>
  cols.mkString("\t")
}.foreach { 
  println
}

它将输出以下内容:

0       0       A       45.95
0       0       B       55.95
1       1       A       145.95
1       1       B       155.95

以及编写 Scala 时要识别的常见模式:大多数 flatMap flatMap ... map 可以重写为 for-comprehensions:

for {
    book <- xmlload \ "book"
    market <- book \ "market"
    price <- market \ "price"
} yield {
  val cols = Seq(
    book \@ "_id",
    market \@ "_id",
    market \@ "type",
    price.text.toFloat
  )
  println(cols.mkString("\t"))
}

我使用 Spark 和 hiveContext 我能够解析 xPath。

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","D:\IBM\DB\Hadoop\winutils")   // Path for my winutils.exe

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val myXmlPath = "D:\IBM\DB\xml"
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.

    import hiveContext.implicits._

    val xmlDf = xmlRDDList.toDF("tempXMLTable")
    xmlDf.registerTempTable("tempTable")

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()      

    /*  Output
        +------+------+
        |BookId| Price|
        +------+------+
        |     0| 55.95|
        |     1|155.95|
        +------+------+
    */
}