使用 UDF 从 PySpark Dataframe 解析嵌套的 XML 字段
Parsing the nested XML fields from PySpark Dataframe using UDF
我有一个场景,其中数据框列中有 XML 个数据。
sex
updated_at
visitors
F
1574264158
<?xml version="1.0" encoding="utf-8
我想使用 UDF
将 - 访客列 - 嵌套的 XML 字段解析为 Dataframe 中的列
XML
的格式
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>
您可以在不使用 UDF 的情况下使用 xpath
查询:
df = spark.createDataFrame([['<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>']], ['visitors'])
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|visitors |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df2 = df.selectExpr(
"xpath(visitors, './visitors/visitor/@id') id",
"xpath(visitors, './visitors/visitor/@age') age",
"xpath(visitors, './visitors/visitor/@sex') sex"
).selectExpr(
"explode(arrays_zip(id, age, sex)) visitors"
).select('visitors.*')
df2.show(truncate=False)
+----+---+---+
|id |age|sex|
+----+---+---+
|9615|68 |F |
|1882|34 |M |
|5987|23 |M |
+----+---+---+
如果你坚持使用UDF:
import xml.etree.ElementTree as ET
import pyspark.sql.functions as F
@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
root = ET.fromstring(s)
return list(map(lambda x: x.attrib, root.findall('visitor')))
df2 = df.select(
F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')
df2.show()
+----+---+---+
| id|age|sex|
+----+---+---+
|9615| 68| F|
|1882| 34| M|
|5987| 23| M|
+----+---+---+
我有一个场景,其中数据框列中有 XML 个数据。
sex | updated_at | visitors |
---|---|---|
F | 1574264158 | <?xml version="1.0" encoding="utf-8 |
我想使用 UDF
将 - 访客列 - 嵌套的 XML 字段解析为 Dataframe 中的列XML
的格式<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>
您可以在不使用 UDF 的情况下使用 xpath
查询:
df = spark.createDataFrame([['<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>']], ['visitors'])
df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|visitors |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df2 = df.selectExpr(
"xpath(visitors, './visitors/visitor/@id') id",
"xpath(visitors, './visitors/visitor/@age') age",
"xpath(visitors, './visitors/visitor/@sex') sex"
).selectExpr(
"explode(arrays_zip(id, age, sex)) visitors"
).select('visitors.*')
df2.show(truncate=False)
+----+---+---+
|id |age|sex|
+----+---+---+
|9615|68 |F |
|1882|34 |M |
|5987|23 |M |
+----+---+---+
如果你坚持使用UDF:
import xml.etree.ElementTree as ET
import pyspark.sql.functions as F
@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
root = ET.fromstring(s)
return list(map(lambda x: x.attrib, root.findall('visitor')))
df2 = df.select(
F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')
df2.show()
+----+---+---+
| id|age|sex|
+----+---+---+
|9615| 68| F|
|1882| 34| M|
|5987| 23| M|
+----+---+---+