使用具有多个属性的 scala-xml API 进行解析
Parsing with scala-xml API with multiple attributes
我有 XML,我正在尝试使用 Scala XML API。我有 XPath 查询来从 XML 标签中检索数据。我想从 <market>
中检索 <price>
标签值,但使用 _id
和 type
这两个属性。我想用 &&
写一个条件,这样我就会得到每个价格标签的唯一值,例如其中 MARKET _ID = 1 && TYPE = "A"
.
作为参考,在下面找到 XML:
<publisher>
<book _id = "0">
<author _id="0">Dev</author>
<publish_date>24 Feb 1995</publish_date>
<description>Data Structure - C</description>
<market _id="0" type="A">
<price>45.95</price>
</market>
<market _id="0" type="B">
<price>55.95</price>
</market>
</book>
<book _id="1">
<author _id = "1">Ram</author>
<publish_date>02 Jul 1999</publish_date>
<description>Data Structure - Java</description>
<market _id="1" type="A">
<price>145.95</price>
</market>
<market _id="1" type="B">
<price>155.95</price>
</market>
</book>
</publisher>
以下代码运行良好
import scala.xml._
object XMLtoCSV extends App {
val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")
val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text //45.95
val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text //155.95
println("price = " + price)
println("price1 = " + price1)
}
输出为:
price = 45.9555.95
price1 = 145.95155.95
我上面的代码给了我两个值,因为我无法放置 && 条件。
- 除过滤外,请告知我可以使用哪些 SCALA 函数。
- 另外让我知道如何获取所有属性名称。
- 如果可能,请告诉我从哪里可以阅读所有这些 API。
提前致谢。
您可以编写自定义谓词来检查多个属性:
def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
node.attribute("_id").exists(_.text == marketId) &&
node.attribute("type").exists(_.text == marketType)
}
然后将其用作过滤器:
val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95
val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95
如果您有兴趣获取数据的 CSV 文件,这就是编写它的方式:
(xmlload \ "book").flatMap { bk =>
(bk \ "market").flatMap { mkt =>
(mkt \ "price").map { p =>
Seq(
bk \@ "_id",
mkt \@ "_id",
mkt \@ "type",
p.text.toFloat
)
}
}
}.map { cols =>
cols.mkString("\t")
}.foreach {
println
}
它将输出以下内容:
0 0 A 45.95
0 0 B 55.95
1 1 A 145.95
1 1 B 155.95
以及编写 Scala 时要识别的常见模式:大多数 flatMap
flatMap
... map
可以重写为 for
-comprehensions:
for {
book <- xmlload \ "book"
market <- book \ "market"
price <- market \ "price"
} yield {
val cols = Seq(
book \@ "_id",
market \@ "_id",
market \@ "type",
price.text.toFloat
)
println(cols.mkString("\t"))
}
我使用 Spark 和 hiveContext 我能够解析 xPath。
object xPathReader extends App{
System.setProperty("hadoop.home.dir","D:\IBM\DB\Hadoop\winutils") // Path for my winutils.exe
val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
val myXmlPath = "D:\IBM\DB\xml"
val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.
import hiveContext.implicits._
val xmlDf = xmlRDDList.toDF("tempXMLTable")
xmlDf.registerTempTable("tempTable")
hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()
/* Output
+------+------+
|BookId| Price|
+------+------+
| 0| 55.95|
| 1|155.95|
+------+------+
*/
}
我有 XML,我正在尝试使用 Scala XML API。我有 XPath 查询来从 XML 标签中检索数据。我想从 <market>
中检索 <price>
标签值,但使用 _id
和 type
这两个属性。我想用 &&
写一个条件,这样我就会得到每个价格标签的唯一值,例如其中 MARKET _ID = 1 && TYPE = "A"
.
作为参考,在下面找到 XML:
<publisher>
<book _id = "0">
<author _id="0">Dev</author>
<publish_date>24 Feb 1995</publish_date>
<description>Data Structure - C</description>
<market _id="0" type="A">
<price>45.95</price>
</market>
<market _id="0" type="B">
<price>55.95</price>
</market>
</book>
<book _id="1">
<author _id = "1">Ram</author>
<publish_date>02 Jul 1999</publish_date>
<description>Data Structure - Java</description>
<market _id="1" type="A">
<price>145.95</price>
</market>
<market _id="1" type="B">
<price>155.95</price>
</market>
</book>
</publisher>
以下代码运行良好
import scala.xml._
object XMLtoCSV extends App {
val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")
val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text //45.95
val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text //155.95
println("price = " + price)
println("price1 = " + price1)
}
输出为:
price = 45.9555.95
price1 = 145.95155.95
我上面的代码给了我两个值,因为我无法放置 && 条件。
- 除过滤外,请告知我可以使用哪些 SCALA 函数。
- 另外让我知道如何获取所有属性名称。
- 如果可能,请告诉我从哪里可以阅读所有这些 API。
提前致谢。
您可以编写自定义谓词来检查多个属性:
def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
node.attribute("_id").exists(_.text == marketId) &&
node.attribute("type").exists(_.text == marketType)
}
然后将其用作过滤器:
val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95
val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95
如果您有兴趣获取数据的 CSV 文件,这就是编写它的方式:
(xmlload \ "book").flatMap { bk =>
(bk \ "market").flatMap { mkt =>
(mkt \ "price").map { p =>
Seq(
bk \@ "_id",
mkt \@ "_id",
mkt \@ "type",
p.text.toFloat
)
}
}
}.map { cols =>
cols.mkString("\t")
}.foreach {
println
}
它将输出以下内容:
0 0 A 45.95
0 0 B 55.95
1 1 A 145.95
1 1 B 155.95
以及编写 Scala 时要识别的常见模式:大多数 flatMap
flatMap
... map
可以重写为 for
-comprehensions:
for {
book <- xmlload \ "book"
market <- book \ "market"
price <- market \ "price"
} yield {
val cols = Seq(
book \@ "_id",
market \@ "_id",
market \@ "type",
price.text.toFloat
)
println(cols.mkString("\t"))
}
我使用 Spark 和 hiveContext 我能够解析 xPath。
object xPathReader extends App{
System.setProperty("hadoop.home.dir","D:\IBM\DB\Hadoop\winutils") // Path for my winutils.exe
val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
val myXmlPath = "D:\IBM\DB\xml"
val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.
import hiveContext.implicits._
val xmlDf = xmlRDDList.toDF("tempXMLTable")
xmlDf.registerTempTable("tempTable")
hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()
/* Output
+------+------+
|BookId| Price|
+------+------+
| 0| 55.95|
| 1|155.95|
+------+------+
*/
}