在 PySpark 中进行明智的排名
Score wise ranking in PySpark
我的 spark 数据看起来像 -
area product score
a aa .39
a bb .03
a cc 1.1
a dd .5
b ee .02
b aa 1.2
b mm .5
b bb 1.3
我想要基于分数变量的前 3 名产品区域明智排名。我的最终输出应该是
area product score rank
a cc 1.1 1
a dd .5 2
a a .39 3
b bb 1.3 1
b aa 1.2 2
b mm .5 3
如何在 PySpark 中实现?
到目前为止我已经做了 -
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("score"))
df = df.withColumn(
"rank",
psf.dense_rank().over(wA))
但对我不起作用。
按 area
划分并过滤 rank<=3
将给出结果
import pyspark.sql.functions as psf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("Test").master("local[*]") \
.getOrCreate()
df = spark.createDataFrame([('a', 'aa', .39),
('a', 'bb', .03),
('a', 'cc', 1.1),
('a', 'dd', .5),
('b', 'ee', .02),
('b', 'aa', 1.2),
('b', 'mm', .5),
('b', 'bb', 1.3)],
['area', 'product', 'score'])
wA = Window.partitionBy("area").orderBy(psf.desc("score"))
df = df.withColumn("rank",
psf.dense_rank().over(wA))
df.filter("rank<=3").show()
我的 spark 数据看起来像 -
area product score
a aa .39
a bb .03
a cc 1.1
a dd .5
b ee .02
b aa 1.2
b mm .5
b bb 1.3
我想要基于分数变量的前 3 名产品区域明智排名。我的最终输出应该是
area product score rank
a cc 1.1 1
a dd .5 2
a a .39 3
b bb 1.3 1
b aa 1.2 2
b mm .5 3
如何在 PySpark 中实现?
到目前为止我已经做了 -
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("score"))
df = df.withColumn(
"rank",
psf.dense_rank().over(wA))
但对我不起作用。
按 area
划分并过滤 rank<=3
将给出结果
import pyspark.sql.functions as psf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("Test").master("local[*]") \
.getOrCreate()
df = spark.createDataFrame([('a', 'aa', .39),
('a', 'bb', .03),
('a', 'cc', 1.1),
('a', 'dd', .5),
('b', 'ee', .02),
('b', 'aa', 1.2),
('b', 'mm', .5),
('b', 'bb', 1.3)],
['area', 'product', 'score'])
wA = Window.partitionBy("area").orderBy(psf.desc("score"))
df = df.withColumn("rank",
psf.dense_rank().over(wA))
df.filter("rank<=3").show()