在 spark 中读取 XML
Read XML in spark
我正在尝试使用 spark-xml jar 在 pyspark 中读取 xml/nested xml。
df = sqlContext.read \
.format("com.databricks.spark.xml")\
.option("rowTag", "hierachy")\
.load("test.xml"
当我执行时,数据框没有正确创建。
+--------------------+
| att|
+--------------------+
|[[1,Data,[Wrapped...|
+--------------------+
xml 我的格式如下:
heirarchy
应该是 rootTag 而 att
应该是 rowTag 作为
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
你应该得到
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
和schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
查找有关 databricks xml
的更多信息
Databricks 已发布新版本以读取 xml 到 Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
我在此示例中使用的输入 XML 文件可在 GitHub 存储库中获得。
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
架构
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
输出:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
请注意 Spark XML API 有一些限制并在此处讨论 Spark-XML API Limitations
希望对您有所帮助!!
您可以使用 Databricks jar 将 xml 解析为数据帧。可以使用maven或者sbt编译依赖也可以直接使用spark submit的jar
pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "SmsRecords") \
.option("rowTag", "sms") \
.load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")
Schema>>> df.printSchema()
root
|-- address: string (nullable = true)
|-- body: string (nullable = true)
|-- date: long (nullable = true)
|-- type: long (nullable = true)
>>> df.select("address").distinct().count()
530
关注这个
http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html
我正在尝试使用 spark-xml jar 在 pyspark 中读取 xml/nested xml。
df = sqlContext.read \
.format("com.databricks.spark.xml")\
.option("rowTag", "hierachy")\
.load("test.xml"
当我执行时,数据框没有正确创建。
+--------------------+
| att|
+--------------------+
|[[1,Data,[Wrapped...|
+--------------------+
xml 我的格式如下:
heirarchy
应该是 rootTag 而 att
应该是 rowTag 作为
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
你应该得到
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
和schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
查找有关 databricks xml
的更多信息Databricks 已发布新版本以读取 xml 到 Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
我在此示例中使用的输入 XML 文件可在 GitHub 存储库中获得。
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
架构
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
输出:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
请注意 Spark XML API 有一些限制并在此处讨论 Spark-XML API Limitations
希望对您有所帮助!!
您可以使用 Databricks jar 将 xml 解析为数据帧。可以使用maven或者sbt编译依赖也可以直接使用spark submit的jar
pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "SmsRecords") \
.option("rowTag", "sms") \
.load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")
Schema>>> df.printSchema()
root
|-- address: string (nullable = true)
|-- body: string (nullable = true)
|-- date: long (nullable = true)
|-- type: long (nullable = true)
>>> df.select("address").distinct().count()
530
关注这个 http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html