pyspark 将单列转换为多列
pyspark to convert single col into multiple cols
我有如下数据框:
+-----------------------------------------------------------------+
|ID|DATASET |
+-----------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
|8B|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
+-----------------------------------------------------------------+
预期输出:
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|"12ABC"|"141"||"ABCD" |
|8B|"12ABC"|"141"||"ABCD" |
+--------------------------------- +
- 我尝试使用 regex_extract :
df.withColumn("col_1", regexp_extract("DATASET", "(?<=col.1:)\w+(?=(,|}|] ))", 0)).withColumn("col_2", regexp_extract("DATASET", "(?<=col.2:)\w+(?=(,|}|]) )", 0))
但在结果中得到空值
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|||| |
|8B|||| |
+--------------------------------- +
关于此的任何输入
提前致谢
已编辑
感谢您的回复,效果很好,我的输入有点改变,我想根据 col1 进行分组并将值放在不同的行中
更新数据集:
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
+---------------------------------------------------------------------------------------------------------------------------+
预期结果:
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
+-----------------------------------------------------------------+
提前致谢
尝试使用结构字段
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('`col.1`', StringType(), True),
StructField('`col.2`', StringType(), True),
StructField('`col.3`', StringType(), True),
StructField('`col.4`', StringType(), True),
]
)
df.withColumn("DATASET", from_json("DATASET", schema))\
.select(col('ID'), col('data.*'))\
.show()
我有如下数据框:
+-----------------------------------------------------------------+
|ID|DATASET |
+-----------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
|8B|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
+-----------------------------------------------------------------+
预期输出:
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|"12ABC"|"141"||"ABCD" |
|8B|"12ABC"|"141"||"ABCD" |
+--------------------------------- +
- 我尝试使用 regex_extract :
df.withColumn("col_1", regexp_extract("DATASET", "(?<=col.1:)\w+(?=(,|}|] ))", 0)).withColumn("col_2", regexp_extract("DATASET", "(?<=col.2:)\w+(?=(,|}|]) )", 0))
但在结果中得到空值
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|||| |
|8B|||| |
+--------------------------------- +
关于此的任何输入
提前致谢
已编辑
感谢您的回复,效果很好,我的输入有点改变,我想根据 col1 进行分组并将值放在不同的行中
更新数据集:
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|[{"col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"},{"col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"}] |
+---------------------------------------------------------------------------------------------------------------------------+
预期结果:
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
|4A|"13ABC"|"{"col.2":"141","col.3":"","col.4":"ABCD"}" |
+-----------------------------------------------------------------+
提前致谢
尝试使用结构字段
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('`col.1`', StringType(), True),
StructField('`col.2`', StringType(), True),
StructField('`col.3`', StringType(), True),
StructField('`col.4`', StringType(), True),
]
)
df.withColumn("DATASET", from_json("DATASET", schema))\
.select(col('ID'), col('data.*'))\
.show()