使用 pyspark 按计数排序
Sorting by count using pyspark
我正在尝试打印前 11 个州、每个州最大的城市以及每个州的企业数量。出于某种原因,我无法打印州的业务计数,只能获取城市的计数。
这是我遇到问题的代码
dun=df_busSelected.groupBy("state","city").count().orderBy("count",ascending=False).limit(11).show(truncate=False)
+-----+----------+-----+
|state|city |count|
+-----+----------+-----+
|NV |Las Vegas |29361|
|ON |Toronto |18904|
|AZ |Phoenix |18764|
|NC |Charlotte |9507 |
|AZ |Scottsdale|8837 |
|AB |Calgary |7735 |
|PA |Pittsburgh|7016 |
|QC |Montréal |6449 |
|AZ |Mesa |6080 |
|NV |Henderson |4892 |
|AZ |Tempe |4550 |
+-----+----------+-----+
如果我理解正确你需要做什么:
from pyspark.sql.functions import *
df_busSelected = spark.createDataFrame([("NV", "Las Vegas",29361),("ON", "Toronto" ,18904),("AZ", "Phoenix",18764),("NC", "Charlotte",9507),("AZ", "Scottsdale",8837),("AB", "Calgary",7735),("PA", "Pittsburgh",7016),("QC", "Montréal",6449),("AZ", "Mesa",6080),("NV", "Henderson",4892),("AZ", "Tempe",4550)]).toDF("state", "city", "count")
df_busSelected.withColumn("city_total_business", struct(col("count"), col("city")))\
.groupBy("state")\
.agg(sort_array(collect_set(col("city_total_business")), False)[0].name("top_city"))\
.withColumn("city", col("top_city").getItem("city"))\
.withColumn("count", col("top_city").getItem("count"))\
.drop("top_city")\
.show()
打印出来
+-----+----------+-----+
|state| city|count|
+-----+----------+-----+
| AZ| Phoenix|18764|
| QC| Montréal| 6449|
| NV| Las Vegas|29361|
| NC| Charlotte| 9507|
| PA|Pittsburgh| 7016|
| ON| Toronto|18904|
| AB| Calgary| 7735|
+-----+----------+-----+
这个 returns 每个州的计数最高的城市。现在可以很容易地对它们进行排序和做你想做的事。
如果喜欢请给我的答案打分
我正在尝试打印前 11 个州、每个州最大的城市以及每个州的企业数量。出于某种原因,我无法打印州的业务计数,只能获取城市的计数。
这是我遇到问题的代码
dun=df_busSelected.groupBy("state","city").count().orderBy("count",ascending=False).limit(11).show(truncate=False)
+-----+----------+-----+
|state|city |count|
+-----+----------+-----+
|NV |Las Vegas |29361|
|ON |Toronto |18904|
|AZ |Phoenix |18764|
|NC |Charlotte |9507 |
|AZ |Scottsdale|8837 |
|AB |Calgary |7735 |
|PA |Pittsburgh|7016 |
|QC |Montréal |6449 |
|AZ |Mesa |6080 |
|NV |Henderson |4892 |
|AZ |Tempe |4550 |
+-----+----------+-----+
如果我理解正确你需要做什么:
from pyspark.sql.functions import *
df_busSelected = spark.createDataFrame([("NV", "Las Vegas",29361),("ON", "Toronto" ,18904),("AZ", "Phoenix",18764),("NC", "Charlotte",9507),("AZ", "Scottsdale",8837),("AB", "Calgary",7735),("PA", "Pittsburgh",7016),("QC", "Montréal",6449),("AZ", "Mesa",6080),("NV", "Henderson",4892),("AZ", "Tempe",4550)]).toDF("state", "city", "count")
df_busSelected.withColumn("city_total_business", struct(col("count"), col("city")))\
.groupBy("state")\
.agg(sort_array(collect_set(col("city_total_business")), False)[0].name("top_city"))\
.withColumn("city", col("top_city").getItem("city"))\
.withColumn("count", col("top_city").getItem("count"))\
.drop("top_city")\
.show()
打印出来
+-----+----------+-----+
|state| city|count|
+-----+----------+-----+
| AZ| Phoenix|18764|
| QC| Montréal| 6449|
| NV| Las Vegas|29361|
| NC| Charlotte| 9507|
| PA|Pittsburgh| 7016|
| ON| Toronto|18904|
| AB| Calgary| 7735|
+-----+----------+-----+
这个 returns 每个州的计数最高的城市。现在可以很容易地对它们进行排序和做你想做的事。
如果喜欢请给我的答案打分