pyspark 将单列转换为多列

pyspark to convert single col into multiple cols

我有如下数据框:

+-----------------------------------------------------------------+
|ID|DATASET                                                       |
+-----------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
|8B|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
+-----------------------------------------------------------------+

预期输出:

+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4   |
+----------------------------------+
|4A|"12ABC"|"141"||"ABCD"          |
|8B|"12ABC"|"141"||"ABCD"          |
+--------------------------------- +

df.withColumn("col_1", regexp_extract("DATASET", "(?<=col.1:)\w+(?=(,|}|] ))", 0)).withColumn("col_2", regexp_extract("DATASET", "(?<=col.2:)\w+(?=(,|}|]) )", 0))

但在结果中得到空值

+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4   |
+----------------------------------+
|4A||||                            |
|8B||||                            |
+--------------------------------- +

关于此的任何输入

提前致谢

已编辑

感谢您的回复,效果很好,我的输入有点改变,我想根据 col1 进行分组并将值放在不同的行中

更新数据集:

+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}]   |
+---------------------------------------------------------------------------------------------------------------------------+

预期结果:

+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col                                           |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}"           |
+-----------------------------------------------------------------+

提前致谢

尝试使用结构字段

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType(
[
    StructField('`col.1`', StringType(), True),
    StructField('`col.2`', StringType(), True),
    StructField('`col.3`', StringType(), True),
    StructField('`col.4`', StringType(), True),
]
)

df.withColumn("DATASET", from_json("DATASET", schema))\
.select(col('ID'), col('data.*'))\
.show()