如何在 Pyspark 中创建多个计数列?

How to create multiple count columns in Pyspark?

我有一个 titlebin 的数据框:

+---------------------+-------------+
|                Title|          bin|        
+---------------------+-------------+
|  Forrest Gump (1994)|            3|
|  Pulp Fiction (1994)|            2|
|   Matrix, The (1999)|            3|
|     Toy Story (1995)|            1|                     
|    Fight Club (1999)|            3|
+---------------------+-------------+

如何使用 Pyspark 将 bin 计入新数据框的每一列?例如:

+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|      
+------------+------------+------------+
|           1|          1 |           3|
+------------+------------+------------+

这可能吗?如果你知道怎么做,有人可以帮助我吗?

bin 分组并计数,然后旋转列 bin 并根据需要重命名结果数据框的列:

import pyspark.sql.functions as F

df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))

df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])

df1.show()

#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#|         1|         1|         3|
#+----------+----------+----------+