Pyspark:是否有根据逗号拆分数据框列值的功能

Pyspark: Is there a function to split dataframe column values on the basis of comma

输入

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths,physics          |80,90                  |
|2             |Computer               |73                     |
|3             |music,sports,chemistry |76,89,85               |
+--------------+-----------+-----------+-----------------------+

预期产出

+--------------+-----------------------+-----------------------+
|ID            |Subject                |Marks                  |
+--------------+-----------------------+-----------------------+
|1             |maths                  |80                     |
|1             |physics                |90                     |
|2             |Computer               |73                     |
|3             |music                  |76                     |
|3             |sports                 |89                     |
|3             |chemistry              |85                     |
+--------------+-----------+-----------+-----------------------+

需要帮助来获得预期的输出 ,已经尝试过 explode 函数,但只适用于单列

您可以一次 explode 多列:

cols = ['Subject', 'Marks']
df[cols] = df[cols].apply(lambda x: x.str.split(','))
df.explode(cols)

输出:

   ID    Subject Marks
0   1      maths    80
0   1    physics    90
1   2   Computer    73
2   3      music    76
2   3     sports    89
2   3  chemistry    85

另一种方法;拆分 , 上的列以形成数组。压缩数组并利用 pysparks 的内联函数来实现你想要的

df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')

+---+---------+-----+
| ID|  Subject|Marks|
+---+---------+-----+
|  1|    maths|   80|
|  1|  physics|   90|
|  2| Computer|   73|
|  3|    music|   76|
|  3|   sports|   89|
|  3|chemistry|   85|
+---+---------+-----+