PySpark 爆炸数组<map<string,string>>
PySpark Exploding array<map<string,string>>
我正在尝试分解格式为 array
root
|-- zipcode: string (nullable = true)
|-- employment_status: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
sdf:pyspark.sql.dataframe.DataFrame
zipcode:string
employment_status:array
element:map
key:string
value:string
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|zipcode|employment_status |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|95678 |[[key -> Data, values -> [{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]]]|
|95679 |[[key -> Data, values -> [{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]]] |
|95680 |[[key -> Data, values -> [{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]]] |
|95681 |[[key -> Data, values -> [{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]]] |
|95682 |[[key -> Data, values -> [{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}]]] |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
我可以毫不费力地分解它并使用
的值
sdf2 = sdf.select(sdf.zipcode, explode(sdf.employment_status))
sdf3 = sdf2.select(sdf2.zipcode, explode(sdf2.col))
sdf4 = sdf3.filter(sdf3.value != "Data").select(sdf3.zipcode, sdf3.value)
结果如下:
sdf4:pyspark.sql.dataframe.DataFrame
zipcode:string
value:string
+-------+------------------------------------------------------------------------------------------------------------------+
|zipcode|value |
+-------+------------------------------------------------------------------------------------------------------------------+
|95678 |[{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]|
|95679 |[{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}] |
|95680 |[{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}] |
|95681 |[{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}] |
|95682 |[{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}] |
+-------+------------------------------------------------------------------------------------------------------------------+
我有 F.regexp_extract 和 F.collect_list 的解决方案,但感觉不正确。
结果应该是下面的
+-------+-------------+-------------+------------+
|zipcode|full_employed|part_employed|non_employed|
+-------+-------------+-------------+------------+
| 95678| 13348| 8918| 9972|
| 95679| 0| 29| 0|
| 95680| 43| 0| 71|
| 95681| 327| 265| 691|
| 95682| 8534| 6436| 8748|
+-------+-------------+-------------+------------+
或者“全职”、“兼职”和“无收入”作为同名,猜猜这无关紧要。
非常感谢任何想法!
谢谢!
是这样的吗?
from pyspark.sql import functions as F
(sdf4
.withColumn('y1', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 1).cast('int'))
.withColumn('y2', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 2).cast('int'))
.withColumn('y3', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 3).cast('int'))
.select('zipcode', F
.expr('stack(1, y1, y2, y3)')
.alias('full_employed','part_employed','non_employed')
)
.show()
)
# Output
# +-------+-------------+-------------+------------+
# |zipcode|full_employed|part_employed|non_employed|
# +-------+-------------+-------------+------------+
# | 95678| 13348| 8918| 9972|
# | 95679| 0| 29| 0|
# | 95680| 43| 0| 71|
# | 95681| 327| 265| 278|
# | 95682| 8534| 6436| 8748|
# +-------+-------------+-------------+------------+
我正在尝试分解格式为 array
root
|-- zipcode: string (nullable = true)
|-- employment_status: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
sdf:pyspark.sql.dataframe.DataFrame
zipcode:string
employment_status:array
element:map
key:string
value:string
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|zipcode|employment_status |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
|95678 |[[key -> Data, values -> [{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]]]|
|95679 |[[key -> Data, values -> [{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}]]] |
|95680 |[[key -> Data, values -> [{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}]]] |
|95681 |[[key -> Data, values -> [{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}]]] |
|95682 |[[key -> Data, values -> [{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}]]] |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------+
我可以毫不费力地分解它并使用
的值sdf2 = sdf.select(sdf.zipcode, explode(sdf.employment_status))
sdf3 = sdf2.select(sdf2.zipcode, explode(sdf2.col))
sdf4 = sdf3.filter(sdf3.value != "Data").select(sdf3.zipcode, sdf3.value)
结果如下:
sdf4:pyspark.sql.dataframe.DataFrame
zipcode:string
value:string
+-------+------------------------------------------------------------------------------------------------------------------+
|zipcode|value |
+-------+------------------------------------------------------------------------------------------------------------------+
|95678 |[{x=Full-time, y=13348}, {x=Part-time, y=8918}, {x=No Earnings, y=9972}]|
|95679 |[{x=Full-time, y=0}, {x=Part-time, y=29}, {x=No Earnings, y=0}] |
|95680 |[{x=Full-time, y=43}, {x=Part-time, y=0}, {x=No Earnings, y=71}] |
|95681 |[{x=Full-time, y=327}, {x=Part-time, y=265}, {x=No Earnings, y=278}] |
|95682 |[{x=Full-time, y=8534}, {x=Part-time, y=6436}, {x=No Earnings, y=8748}] |
+-------+------------------------------------------------------------------------------------------------------------------+
我有 F.regexp_extract 和 F.collect_list 的解决方案,但感觉不正确。 结果应该是下面的
+-------+-------------+-------------+------------+
|zipcode|full_employed|part_employed|non_employed|
+-------+-------------+-------------+------------+
| 95678| 13348| 8918| 9972|
| 95679| 0| 29| 0|
| 95680| 43| 0| 71|
| 95681| 327| 265| 691|
| 95682| 8534| 6436| 8748|
+-------+-------------+-------------+------------+
或者“全职”、“兼职”和“无收入”作为同名,猜猜这无关紧要。
非常感谢任何想法! 谢谢!
是这样的吗?
from pyspark.sql import functions as F
(sdf4
.withColumn('y1', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 1).cast('int'))
.withColumn('y2', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 2).cast('int'))
.withColumn('y3', F.regexp_extract('value', 'y=([^}]+).*y=([^}]+).*y=([^}]+)', 3).cast('int'))
.select('zipcode', F
.expr('stack(1, y1, y2, y3)')
.alias('full_employed','part_employed','non_employed')
)
.show()
)
# Output
# +-------+-------------+-------------+------------+
# |zipcode|full_employed|part_employed|non_employed|
# +-------+-------------+-------------+------------+
# | 95678| 13348| 8918| 9972|
# | 95679| 0| 29| 0|
# | 95680| 43| 0| 71|
# | 95681| 327| 265| 278|
# | 95682| 8534| 6436| 8748|
# +-------+-------------+-------------+------------+