如何拆分对象列表以分隔 pyspark 数据框中的列
How to split a list of objects to separate columns in pyspark dataframe
我在数据框中有一列作为对象列表(结构数组),如
column: [{key1:value1}, {key2:value2}, {key3:value3}]
我想将此列拆分为单独的列,键名作为列名,值作为同一行中的列值。
最终结果如
key1:value1, key2:value2, key3:value3
如何在 pyspark 中实现这一点?
例如
创建数据框的示例数据:
my_new_schema = StructType([
StructField('id', LongType()),
StructField('countries', ArrayType(StructType([
StructField('name', StringType()),
StructField('capital', StringType())
])))
])
l = [(1, [
{'name': 'Italy', 'capital': 'Rome'},
{'name': 'Spain', 'capital': 'Madrid'}
])
]
dz = spark.createDataFrame(l, schema=my_new_schema)
# we have array of structs:
dz.show(truncate=False)
+---+--------------------------------+
|id |countries |
+---+--------------------------------+
|1 |[{Italy, Rome}, {Spain, Madrid}]|
+---+--------------------------------+
预期输出:
+---+--------+---------+
|id |Italy | Spain |
+---+------------------+
|1 |Rome | Madrid |
+---+--------+---------+
inline
countries
数组然后旋转国家 name
列:
import pyspark.sql.functions as F
dz1 = dz.selectExpr(
"id",
"inline(countries)"
).groupBy("id").pivot("name").agg(
F.first("capital")
)
dz1.show()
#+---+-----+------+
#|id |Italy|Spain |
#+---+-----+------+
#|1 |Rome |Madrid|
#+---+-----+------+
我在数据框中有一列作为对象列表(结构数组),如
column: [{key1:value1}, {key2:value2}, {key3:value3}]
我想将此列拆分为单独的列,键名作为列名,值作为同一行中的列值。
最终结果如
key1:value1, key2:value2, key3:value3
如何在 pyspark 中实现这一点?
例如
创建数据框的示例数据:
my_new_schema = StructType([
StructField('id', LongType()),
StructField('countries', ArrayType(StructType([
StructField('name', StringType()),
StructField('capital', StringType())
])))
])
l = [(1, [
{'name': 'Italy', 'capital': 'Rome'},
{'name': 'Spain', 'capital': 'Madrid'}
])
]
dz = spark.createDataFrame(l, schema=my_new_schema)
# we have array of structs:
dz.show(truncate=False)
+---+--------------------------------+
|id |countries |
+---+--------------------------------+
|1 |[{Italy, Rome}, {Spain, Madrid}]|
+---+--------------------------------+
预期输出:
+---+--------+---------+
|id |Italy | Spain |
+---+------------------+
|1 |Rome | Madrid |
+---+--------+---------+
inline
countries
数组然后旋转国家 name
列:
import pyspark.sql.functions as F
dz1 = dz.selectExpr(
"id",
"inline(countries)"
).groupBy("id").pivot("name").agg(
F.first("capital")
)
dz1.show()
#+---+-----+------+
#|id |Italy|Spain |
#+---+-----+------+
#|1 |Rome |Madrid|
#+---+-----+------+