PySpark:从字符串类型列的不同值中创建聚合列
PySpark: creating aggregated columns out of a string type column different values
我有这个数据框:
+---------+--------+------+
| topic| emotion|counts|
+---------+--------+------+
| dog | sadness| 4 |
| cat |surprise| 1 |
| bird | fear| 3 |
| cat | joy| 2 |
| dog |surprise| 10 |
| dog |surprise| 3 |
+---------+--------+------+
我想为每种不同的情绪创建一个列,汇总每个主题和每种情绪的计数,最终得到如下输出:
+---------+--------+---------+-----+----------+
| topic| fear | sadness | joy | surprise |
+---------+--------+---------+-----+----------+
| dog | 0 | 4 | 0 | 13 |
| cat | 0 | 0 | 2 | 1 |
| bird | 3 | 0 | 0 | 0 |
+---------+--------+---------+-----+----------+
这是我到目前为止尝试过的方法,对于恐惧栏,但其他情绪一直出现在每个主题中,我怎样才能得到像上面这样的结果?
agg_emotion = df.groupby("topic", "emotion") \
.agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
.otherwise(0)).alias('fear'))
groupby 总和然后分组 pivot 结果
df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()
+-----+----+---+-------+--------+
|topic|fear|joy|sadness|surprise|
+-----+----+---+-------+--------+
| dog| 0| 0| 4| 13|
| cat| 0| 2| 0| 1|
| bird| 3| 0| 0| 0|
+-----+----+---+-------+--------+
我有这个数据框:
+---------+--------+------+
| topic| emotion|counts|
+---------+--------+------+
| dog | sadness| 4 |
| cat |surprise| 1 |
| bird | fear| 3 |
| cat | joy| 2 |
| dog |surprise| 10 |
| dog |surprise| 3 |
+---------+--------+------+
我想为每种不同的情绪创建一个列,汇总每个主题和每种情绪的计数,最终得到如下输出:
+---------+--------+---------+-----+----------+
| topic| fear | sadness | joy | surprise |
+---------+--------+---------+-----+----------+
| dog | 0 | 4 | 0 | 13 |
| cat | 0 | 0 | 2 | 1 |
| bird | 3 | 0 | 0 | 0 |
+---------+--------+---------+-----+----------+
这是我到目前为止尝试过的方法,对于恐惧栏,但其他情绪一直出现在每个主题中,我怎样才能得到像上面这样的结果?
agg_emotion = df.groupby("topic", "emotion") \
.agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
.otherwise(0)).alias('fear'))
groupby 总和然后分组 pivot 结果
df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()
+-----+----+---+-------+--------+
|topic|fear|joy|sadness|surprise|
+-----+----+---+-------+--------+
| dog| 0| 0| 4| 13|
| cat| 0| 2| 0| 1|
| bird| 3| 0| 0| 0|
+-----+----+---+-------+--------+