PySpark 爆炸数组<map<string,string>>

Question

我正在尝试分解格式为 array> 的列。在 DataBricks 中使用 PySpark。数据如下：

root
 |-- zipcode: string (nullable = true)
 |-- employment_status: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

sdf:pyspark.sql.dataframe.DataFrame
zipcode:string
employment_status:array
element:map
key:string
value:string


+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|zipcode|employment_status                                                                                                                            |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|95678  |[[key -> Data, values -> [{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]]]|
|95679  |[[key -> Data, values -> [{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]]]         |
|95680  |[[key -> Data, values -> [{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]]]        |
|95681  |[[key -> Data, values -> [{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]]]    |
|95682  |[[key -> Data, values -> [{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}]]] |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+

我可以毫不费力地分解它并使用

的值

sdf2 = sdf.select(sdf.zipcode, explode(sdf.employment_status))
sdf3 = sdf2.select(sdf2.zipcode, explode(sdf2.col))
sdf4 = sdf3.filter(sdf3.value != "Data").select(sdf3.zipcode, sdf3.value)

结果如下：

sdf4:pyspark.sql.dataframe.DataFrame
zipcode:string
value:string

+-------+------------------------------------------------------------------------------------------------------------------+
|zipcode|value                                                                                                             |
+-------+------------------------------------------------------------------------------------------------------------------+
|95678  |[{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]|
|95679  |[{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]         |
|95680  |[{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]        |
|95681  |[{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]    |
|95682  |[{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}] |
+-------+------------------------------------------------------------------------------------------------------------------+

我有 F.regexp_extract 和 F.collect_list 的解决方案，但感觉不正确。结果应该是下面的

+-------+-------------+-------------+------------+
|zipcode|full_employed|part_employed|non_employed|
+-------+-------------+-------------+------------+
|  95678|        13348|         8918|        9972|
|  95679|            0|           29|           0|
|  95680|           43|            0|          71|
|  95681|          327|          265|         691|
|  95682|         8534|         6436|        8748|
+-------+-------------+-------------+------------+

或者“全职”、“兼职”和“无收入”作为同名，猜猜这无关紧要。

非常感谢任何想法！谢谢！

Answer 1

是这样的吗？

from pyspark.sql import functions as F

(sdf4
    .withColumn('y1', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 1).cast('int'))
    .withColumn('y2', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 2).cast('int'))
    .withColumn('y3', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 3).cast('int'))

    .select('zipcode', F
        .expr('stack(1, y1, y2, y3)')
        .alias('full_employed','part_employed','non_employed')
    )
    .show()
)

# Output
# +-------+-------------+-------------+------------+
# |zipcode|full_employed|part_employed|non_employed|
# +-------+-------------+-------------+------------+
# |  95678|        13348|         8918|        9972|
# |  95679|            0|           29|           0|
# |  95680|           43|            0|          71|
# |  95681|          327|          265|         278|
# |  95682|         8534|         6436|        8748|
# +-------+-------------+-------------+------------+

PySpark 爆炸数组<map<string,string>>

PySpark Exploding array<map<string,string>>

arrays

dictionary

explode

pyspark