如何在使用 spark 解析 xml 时将 Header 信息添加到行信息
How to add Header info to row info while parsing a xml with spark
我有一个像
这样的 xml 结构
<root>
<bookinfo>
<time>1232314973</time>
<requestID>233</requestID>
<supplier>asd123</supplier>
</bookinfo>
<books>
<book>
<name>book1</name>
<pages>124</pages>
</book>
<book>
<name>book2</name>
<pages>456</pages>
</book>
<book>
<name>book4</name>
<pages>789</pages>
</book>
</books>
</root>
我知道我可以像这样解析 books
:
val xml = sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("FILENAME")
但我想在每一行中添加 Header 信息,例如 supplier
。
有没有办法将此 "headerinfo" 添加到带有 spark 的所有行,而无需加载文件两次并将信息存储在全局 vars/vals 中?
提前致谢!
您可以阅读所有xml从"root"标签开始,然后展开需要的标签:
val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)
println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)
println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)
println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)
输出:
root
|-- bookinfo: struct (nullable = true)
| |-- requestID: long (nullable = true)
| |-- supplier: string (nullable = true)
| |-- time: long (nullable = true)
|-- books: struct (nullable = true)
| |-- book: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- pages: long (nullable = true)
+-----------------------+-----------------------------------------------------+
|bookinfo |books |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+
-- supplier --
root
|-- supplier: string (nullable = true)
+--------+
|supplier|
+--------+
|asd123 |
+--------+
-- books --
root
|-- bookDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- pages: long (nullable = true)
+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+
-- bookDetails --
root
|-- name: string (nullable = true)
|-- pages: long (nullable = true)
+-----+-----+
|name |pages|
+-----+-----+
|book1|124 |
|book2|456 |
|book4|789 |
+-----+-----+
我有一个像
这样的 xml 结构 <root>
<bookinfo>
<time>1232314973</time>
<requestID>233</requestID>
<supplier>asd123</supplier>
</bookinfo>
<books>
<book>
<name>book1</name>
<pages>124</pages>
</book>
<book>
<name>book2</name>
<pages>456</pages>
</book>
<book>
<name>book4</name>
<pages>789</pages>
</book>
</books>
</root>
我知道我可以像这样解析 books
:
val xml = sqlContext.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("FILENAME")
但我想在每一行中添加 Header 信息,例如 supplier
。
有没有办法将此 "headerinfo" 添加到带有 spark 的所有行,而无需加载文件两次并将信息存储在全局 vars/vals 中?
提前致谢!
您可以阅读所有xml从"root"标签开始,然后展开需要的标签:
val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)
println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)
println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)
println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)
输出:
root
|-- bookinfo: struct (nullable = true)
| |-- requestID: long (nullable = true)
| |-- supplier: string (nullable = true)
| |-- time: long (nullable = true)
|-- books: struct (nullable = true)
| |-- book: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- pages: long (nullable = true)
+-----------------------+-----------------------------------------------------+
|bookinfo |books |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+
-- supplier --
root
|-- supplier: string (nullable = true)
+--------+
|supplier|
+--------+
|asd123 |
+--------+
-- books --
root
|-- bookDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- pages: long (nullable = true)
+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+
-- bookDetails --
root
|-- name: string (nullable = true)
|-- pages: long (nullable = true)
+-----+-----+
|name |pages|
+-----+-----+
|book1|124 |
|book2|456 |
|book4|789 |
+-----+-----+