如何计算 pyspark 数据框中列表列中元素的频率?
How to count frequency of elements from a columns of lists in pyspark dataframe?
我有一个 pyspark 数据框,如下所示,
data2 = [("James",["A x","B z","C q","D", "E"]),
("Michael",["A x","C","E","K", "D"]),
("Robert",["A y","R","B z","B","D"]),
("Maria",["X","A y","B z","F","B"]),
("Jen",["A","B","C q","F","R"])
]
df2 = spark.createDataFrame(data2, ["Name", "My_list" ])
df2
Name My_list
0 James [A x, B z, C q, D, E]
1 Michael [A x, C, E, K, D]
2 Robert [A y, R, B z, B, D]
3 Maria [X, A y, B z, F, B]
4 Jen [A, B, C q, F, R]
我希望能够对'My_list'列的元素进行统计并降序排列?例如,
'A x' appeared -> P times,
'B z' appeared -> Q times, and so on.
有人可以给这个点灯吗?非常感谢您。
以下命令分解数组,并提供每个元素的计数
import pyspark.sql.functions as F
df_ans = (df2
.withColumn("explode", F.explode("My_list"))
.groupBy("explode")
.count()
.orderBy(F.desc("count"))
结果是
我有一个 pyspark 数据框,如下所示,
data2 = [("James",["A x","B z","C q","D", "E"]),
("Michael",["A x","C","E","K", "D"]),
("Robert",["A y","R","B z","B","D"]),
("Maria",["X","A y","B z","F","B"]),
("Jen",["A","B","C q","F","R"])
]
df2 = spark.createDataFrame(data2, ["Name", "My_list" ])
df2
Name My_list
0 James [A x, B z, C q, D, E]
1 Michael [A x, C, E, K, D]
2 Robert [A y, R, B z, B, D]
3 Maria [X, A y, B z, F, B]
4 Jen [A, B, C q, F, R]
我希望能够对'My_list'列的元素进行统计并降序排列?例如,
'A x' appeared -> P times,
'B z' appeared -> Q times, and so on.
有人可以给这个点灯吗?非常感谢您。
以下命令分解数组,并提供每个元素的计数
import pyspark.sql.functions as F
df_ans = (df2
.withColumn("explode", F.explode("My_list"))
.groupBy("explode")
.count()
.orderBy(F.desc("count"))
结果是