pyspark - 密集秩关系法优先

pyspark - Dense-Rank ties method first

有没有办法在 pyspark 数据框中应用 dense-rank,但是在找到平局时, 排在第一个 [=17] =]外貌?

同样在 Pandas rank(method='first')

Spark 的分布式特性阻止了隐式识别出现的顺序。如果您输入的数据集包含 line_numberrow_number 之类的列,则可以实现 rank(method='first')

工作示例

以下示例依赖于来自 pd.rank 的数据帧,其中包含 Line_Number 字段以进行显式排序。

数据帧被重新分区以模拟读取数据后的随机排序。

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [{"Line_Number": 1, "Animal": "cat", "Number_legs": 4}, {"Line_Number": 2, "Animal": "penguin", "Number_legs": 2},
        {"Line_Number": 3, "Animal": "dog", "Number_legs": 4}, {"Line_Number": 4, "Animal": "spider", "Number_legs": 8},
        {"Line_Number": 5, "Animal": "snake", "Number_legs": None}]

df = spark.createDataFrame(data).repartition(8)


window_spec = Window.orderBy(F.col("Number_legs").asc_nulls_last(), F.col("Line_Number"))

df.withColumn("rank", F.when(F.col("Number_legs").isNull(), F.lit(None)).otherwise(F.row_number().over(window_spec))).show()

输出

+-------+-----------+-----------+----+
| Animal|Line_Number|Number_legs|rank|
+-------+-----------+-----------+----+
|penguin|          2|          2|   1|
|    cat|          1|          4|   2|
|    dog|          3|          4|   3|
| spider|          4|          8|   4|
|  snake|          5|       null|null|
+-------+-----------+-----------+----+