将 SQL 查询转换为 PySpark DataFrame 查询（百分位排名计算）

Question

我正在尝试将此 SQL 查询转换为 PySpark DataFrame 方法：

SELECT id_profile, indications, PERCENT_RANK()
OVER (PARTITION BY id_profile ORDER BY prediction DESC) AS rank FROM predictions

所以 id_profile、indications 和 prediction 是我的 predictions DataFrame 中的列。

我想我必须用 Window 方法来做到这一点，但我不知道怎么做。

Answer 1

试试这个：

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("id_profile").orderBy(F.col("prediction").desc())

df.withColumn("rank", F.percent_rank().over(w))\
  .select("id_profile","indications","rank")

将 SQL 查询转换为 PySpark DataFrame 查询（百分位排名计算）

Translate SQL query to PySpark DataFrame query (a Percentile Ranking calculation)

pyspark

pyspark-dataframes