Spark:如何从具有属性的多个嵌套 XML 文件转换为数据框数据
Spark: How to transform to Data Frame data from multiple nested XML files with attributes
如何将多个 XML 文件中的以下值转换为 spark 数据框:
- 属性
Id0
来自 Level_0
Date
/Value
来自 Level_4
要求输出:
+----------------+-------------+---------+
|Id0 |Date |Value |
+----------------+-------------+---------+
|Id0_value_file_1| 2021-01-01 | 4_1 |
|Id0_value_file_1| 2021-01-02 | 4_2 |
|Id0_value_file_2| 2021-01-01 | 4_1 |
|Id0_value_file_2| 2021-01-02 | 4_2 |
+----------------+-------+---------------+
file_1.xml:
<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
file_2.xml:
<Level_0 Id0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
当前代码示例:
files_list = ["file_1.xml", "file_2.xml"]
df = (spark.read.format('xml')
.options(rowTag="Level_4")
.load(','.join(files_list))
当前输出:(Id0
列缺少属性)
+-------------+---------+
|Date |Value |
+-------------+---------+
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
+-------+---------------+
有一些例子,但没有一个能解决问题:
-我正在使用数据块 spark_xml - https://github.com/databricks/spark-xml
-有一个示例,但没有属性读取,, .
编辑:
正如@mck 正确指出的 <Level_2>A</Level_2>
是不正确的 XML 格式。我的示例中有一个错误(现在 xml 文件已更正),它应该是 <Level_2_A>A</Level_2_A>
。之后,建议的解决方案甚至适用于多个文件。
注意:为了加快加载大量 xml 定义模式,如果没有定义模式,spark 在创建数据帧时会读取每个文件以干扰模式...
欲了解更多信息:https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
第 1 步):
files_list = ["file_1.xml", "file_2.xml"]
# for schema seem NOTE above
df = (spark.read.format('xml')
.options(rowTag="Level_0")
.load(','.join(files_list),schema=schema))
df.printSchema()
root
|-- Level_1: struct (nullable = true)
| |-- Level_2: struct (nullable = true)
| | |-- Level_3: struct (nullable = true)
| | | |-- Level_4: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Date: string (nullable = true)
| | | | | |-- Value: string (nullable = true)
| |-- Level_2_A: string (nullable = true)
| |-- _Id1_1: string (nullable = true)
| |-- _Id_2: string (nullable = true)
|-- _Id0: string (nullable = true
第 2 步)请参阅下面的@mck 解决方案:
可以使用Level_0
作为rowTag,展开相关的arrays/structs:
import pyspark.sql.functions as F
df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')
df2 = df.select(
'_Id0',
F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'_Id0',
'Level_4.*'
)
df2.show()
+---------------+----------+-----+
| _Id0| Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01| 4_1|
|Id0_value_file1|2021-01-02| 4_2|
+---------------+----------+-----+
如何将多个 XML 文件中的以下值转换为 spark 数据框:
- 属性
Id0
来自Level_0
Date
/Value
来自Level_4
要求输出:
+----------------+-------------+---------+
|Id0 |Date |Value |
+----------------+-------------+---------+
|Id0_value_file_1| 2021-01-01 | 4_1 |
|Id0_value_file_1| 2021-01-02 | 4_2 |
|Id0_value_file_2| 2021-01-01 | 4_1 |
|Id0_value_file_2| 2021-01-02 | 4_2 |
+----------------+-------+---------------+
file_1.xml:
<Level_0 Id0="Id0_value_file1">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
file_2.xml:
<Level_0 Id0="Id0_value_file2">
<Level_1 Id1_1 ="Id3_value" Id_2="Id2_value">
<Level_2_A>A</Level_2_A>
<Level_2>
<Level_3>
<Level_4>
<Date>2021-01-01</Date>
<Value>4_1</Value>
</Level_4>
<Level_4>
<Date>2021-01-02</Date>
<Value>4_2</Value>
</Level_4>
</Level_3>
</Level_2>
</Level_1>
</Level_0>
当前代码示例:
files_list = ["file_1.xml", "file_2.xml"]
df = (spark.read.format('xml')
.options(rowTag="Level_4")
.load(','.join(files_list))
当前输出:(Id0
列缺少属性)
+-------------+---------+
|Date |Value |
+-------------+---------+
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
| 2021-01-01 | 4_1 |
| 2021-01-02 | 4_2 |
+-------+---------------+
有一些例子,但没有一个能解决问题:
-我正在使用数据块 spark_xml - https://github.com/databricks/spark-xml
-有一个示例,但没有属性读取,
编辑:
正如@mck 正确指出的 <Level_2>A</Level_2>
是不正确的 XML 格式。我的示例中有一个错误(现在 xml 文件已更正),它应该是 <Level_2_A>A</Level_2_A>
。之后,建议的解决方案甚至适用于多个文件。
注意:为了加快加载大量 xml 定义模式,如果没有定义模式,spark 在创建数据帧时会读取每个文件以干扰模式... 欲了解更多信息:https://szczeles.github.io/Reading-JSON-CSV-and-XML-files-efficiently-in-Apache-Spark/
第 1 步):
files_list = ["file_1.xml", "file_2.xml"]
# for schema seem NOTE above
df = (spark.read.format('xml')
.options(rowTag="Level_0")
.load(','.join(files_list),schema=schema))
df.printSchema()
root
|-- Level_1: struct (nullable = true)
| |-- Level_2: struct (nullable = true)
| | |-- Level_3: struct (nullable = true)
| | | |-- Level_4: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Date: string (nullable = true)
| | | | | |-- Value: string (nullable = true)
| |-- Level_2_A: string (nullable = true)
| |-- _Id1_1: string (nullable = true)
| |-- _Id_2: string (nullable = true)
|-- _Id0: string (nullable = true
第 2 步)请参阅下面的@mck 解决方案:
可以使用Level_0
作为rowTag,展开相关的arrays/structs:
import pyspark.sql.functions as F
df = spark.read.format('xml').options(rowTag="Level_0").load('line_removed.xml')
df2 = df.select(
'_Id0',
F.explode_outer('Level_1.Level_2.Level_3.Level_4').alias('Level_4')
).select(
'_Id0',
'Level_4.*'
)
df2.show()
+---------------+----------+-----+
| _Id0| Date|Value|
+---------------+----------+-----+
|Id0_value_file1|2021-01-01| 4_1|
|Id0_value_file1|2021-01-02| 4_2|
+---------------+----------+-----+