使用 pyspark 按计数排序

Sorting by count using pyspark

我正在尝试打印前 11 个州、每个州最大的城市以及每个州的企业数量。出于某种原因,我无法打印州的业务计数,只能获取城市的计数。

这是我遇到问题的代码

dun=df_busSelected.groupBy("state","city").count().orderBy("count",ascending=False).limit(11).show(truncate=False)

 +-----+----------+-----+ 
|state|city |count| 
+-----+----------+-----+ 
|NV |Las Vegas |29361| 
|ON |Toronto |18904| 
|AZ |Phoenix |18764| 
|NC |Charlotte |9507 | 
|AZ |Scottsdale|8837 | 
|AB |Calgary |7735 | 
|PA |Pittsburgh|7016 | 
|QC |Montréal |6449 | 
|AZ |Mesa |6080 | 
|NV |Henderson |4892 | 
|AZ |Tempe |4550 | 
+-----+----------+-----+

如果我理解正确你需要做什么:

from pyspark.sql.functions import *
df_busSelected = spark.createDataFrame([("NV", "Las Vegas",29361),("ON", "Toronto" ,18904),("AZ", "Phoenix",18764),("NC", "Charlotte",9507),("AZ", "Scottsdale",8837),("AB", "Calgary",7735),("PA", "Pittsburgh",7016),("QC", "Montréal",6449),("AZ", "Mesa",6080),("NV", "Henderson",4892),("AZ", "Tempe",4550)]).toDF("state", "city", "count")

df_busSelected.withColumn("city_total_business", struct(col("count"), col("city")))\
     .groupBy("state")\
     .agg(sort_array(collect_set(col("city_total_business")), False)[0].name("top_city"))\
     .withColumn("city", col("top_city").getItem("city"))\
     .withColumn("count", col("top_city").getItem("count"))\
     .drop("top_city")\
     .show()

打印出来

+-----+----------+-----+
|state|      city|count|
+-----+----------+-----+
|   AZ|   Phoenix|18764|
|   QC|  Montréal| 6449|
|   NV| Las Vegas|29361|
|   NC| Charlotte| 9507|
|   PA|Pittsburgh| 7016|
|   ON|   Toronto|18904|
|   AB|   Calgary| 7735|
+-----+----------+-----+

这个 returns 每个州的计数最高的城市。现在可以很容易地对它们进行排序和做你想做的事。

如果喜欢请给我的答案打分