如何在使用 spark 解析 xml 时将 Header 信息添加到行信息

How to add Header info to row info while parsing a xml with spark

我有一个像

这样的 xml 结构
 <root>
    <bookinfo>
    <time>1232314973</time>
   <requestID>233</requestID>
   <supplier>asd123</supplier>
  </bookinfo>

 <books>
  <book>
         <name>book1</name>
          <pages>124</pages>
    </book>
    <book>
         <name>book2</name>
          <pages>456</pages>
    </book>
    <book>
         <name>book4</name>
          <pages>789</pages>
    </book>
 </books>
</root>

我知道我可以像这样解析 books

val xml = sqlContext.read.format("com.databricks.spark.xml")
                  .option("rowTag", "book").load("FILENAME")

但我想在每一行中添加 Header 信息,例如 supplier

有没有办法将此 "headerinfo" 添加到带有 spark 的所有行,而无需加载文件两次并将信息存储在全局 vars/vals 中?

提前致谢!

您可以阅读所有xml从"root"标签开始,然后展开需要的标签:

val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)

println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)

println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)

println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)

输出:

root
 |-- bookinfo: struct (nullable = true)
 |    |-- requestID: long (nullable = true)
 |    |-- supplier: string (nullable = true)
 |    |-- time: long (nullable = true)
 |-- books: struct (nullable = true)
 |    |-- book: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- pages: long (nullable = true)

+-----------------------+-----------------------------------------------------+
|bookinfo               |books                                                |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+

-- supplier --
root
 |-- supplier: string (nullable = true)

+--------+
|supplier|
+--------+
|asd123  |
+--------+

-- books --
root
 |-- bookDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- pages: long (nullable = true)

+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+

-- bookDetails --
root
 |-- name: string (nullable = true)
 |-- pages: long (nullable = true)

+-----+-----+
|name |pages|
+-----+-----+
|book1|124  |
|book2|456  |
|book4|789  |
+-----+-----+