Pyspark:是否有根据逗号拆分数据框列值的功能
Pyspark: Is there a function to split dataframe column values on the basis of comma
输入
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths,physics |80,90 |
|2 |Computer |73 |
|3 |music,sports,chemistry |76,89,85 |
+--------------+-----------+-----------+-----------------------+
预期产出
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths |80 |
|1 |physics |90 |
|2 |Computer |73 |
|3 |music |76 |
|3 |sports |89 |
|3 |chemistry |85 |
+--------------+-----------+-----------+-----------------------+
需要帮助来获得预期的输出
,已经尝试过 explode 函数,但只适用于单列
您可以一次 explode
多列:
cols = ['Subject', 'Marks']
df[cols] = df[cols].apply(lambda x: x.str.split(','))
df.explode(cols)
输出:
ID Subject Marks
0 1 maths 80
0 1 physics 90
1 2 Computer 73
2 3 music 76
2 3 sports 89
2 3 chemistry 85
另一种方法;拆分 ,
上的列以形成数组。压缩数组并利用 pysparks 的内联函数来实现你想要的
df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')
+---+---------+-----+
| ID| Subject|Marks|
+---+---------+-----+
| 1| maths| 80|
| 1| physics| 90|
| 2| Computer| 73|
| 3| music| 76|
| 3| sports| 89|
| 3|chemistry| 85|
+---+---------+-----+
输入
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths,physics |80,90 |
|2 |Computer |73 |
|3 |music,sports,chemistry |76,89,85 |
+--------------+-----------+-----------+-----------------------+
预期产出
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths |80 |
|1 |physics |90 |
|2 |Computer |73 |
|3 |music |76 |
|3 |sports |89 |
|3 |chemistry |85 |
+--------------+-----------+-----------+-----------------------+
需要帮助来获得预期的输出 ,已经尝试过 explode 函数,但只适用于单列
您可以一次 explode
多列:
cols = ['Subject', 'Marks']
df[cols] = df[cols].apply(lambda x: x.str.split(','))
df.explode(cols)
输出:
ID Subject Marks
0 1 maths 80
0 1 physics 90
1 2 Computer 73
2 3 music 76
2 3 sports 89
2 3 chemistry 85
另一种方法;拆分 ,
上的列以形成数组。压缩数组并利用 pysparks 的内联函数来实现你想要的
df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')
+---+---------+-----+
| ID| Subject|Marks|
+---+---------+-----+
| 1| maths| 80|
| 1| physics| 90|
| 2| Computer| 73|
| 3| music| 76|
| 3| sports| 89|
| 3|chemistry| 85|
+---+---------+-----+