pyspark groupBy 和 orderBy 一起使用

Question

你好，我想实现这样的目标

SAS SQL: select * from flightData2015 group by DEST_COUNTRY_NAME order by count

我的数据是这样的：

这是我的火花代码：

flightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").orderBy("count").show()

我收到这个错误：

AttributeError：'GroupedData' 对象没有属性 'orderBy'。我是 pyspark 的新手。 Pyspark的groupby和orderby和SAS不一样SQL?

我也尝试排序flightData2015.selectExpr("*").groupBy("DEST_COUNTRY_NAME").sort("count").show()但我收到了类似的错误。 “AttributeError：'GroupedData' 对象没有属性 'sort'” 请帮忙！

Answer 1

在 Spark 中，groupBy returns 一个 GroupedData，不是 DataFrame。通常，您总是会在 groupBy 之后进行聚合。在这种情况下，即使 SAS SQL 没有任何聚合，您仍然必须定义一个（如果需要，稍后可以删除它）。

(flightData2015
    .groupBy("DEST_COUNTRY_NAME")
    .count() # this is the "dummy" aggregation
    .orderBy("count")
    .show()
)

Answer 2

如果您想要每一行，则不需要分组依据。您可以按多列排序。

from pyspark.sql import functions as F
vals = [("United States", "Angola",13), ("United States","Anguilla" , 38), ("United States","Antigua", 20), ("United Kingdom", "Antigua", 22), ("United Kingdom","Peru", 50), ("United Kingdom", "Russisa",13), ("Argentina", "United Kingdom",13),]
cols = ["destination_country_name","origin_conutry_name", "count"]



df = spark.createDataFrame(vals, cols)
#display(df.orderBy(['destination_country_name', F.col('count').desc()])) If you want count to be descending

display(df.orderBy(['destination_country_name', 'count']))

pyspark groupBy 和 orderBy 一起使用

pyspark groupBy and orderBy use together

sorting

group-by

sql-order-by

pyspark