希望根据 Array(Float) 类型的另一列创建 "rank arrays" 列

Question

这是我的数据集：

score  
[0.3, 0.5]
[0.1, 0.6, 0.7]

所需数据集：

score            rank 
[0.3, 0.5]      [1, 2]
[0.1, 0.6, 0.7] [1, 2, 3]

这是我的初步尝试：

df_upd = df.withColumn("rank", F.array([F.lit(i) for i in range(1, F.size("score") + 1)]))

我收到这个错误：

TypeError: range() integer end argument expected, got Column.

我想知道是否有任何简洁的方法可以做到这一点，或者我是否必须分解 df 然后使用 Window 函数创建排名列

Answer 1

您似乎只想创建一个从 1 到 size(score) 的序列，您可以为此使用 sequence 函数：

from pyspark.sql import functions as F

df = spark.createDataFrame([([0.3, 0.5],), ([0.1, 0.6, 0.7],)], ["score"])

df.withColumn("rank", F.expr("sequence(1, size(score))")).show()

#+---------------+---------+
#|          score|     rank|
#+---------------+---------+
#|     [0.3, 0.5]|   [1, 2]|
#|[0.1, 0.6, 0.7]|[1, 2, 3]|
#+---------------+---------+

希望根据 Array(Float) 类型的另一列创建 "rank arrays" 列

Looking to create column of "rank arrays" based on another column of Array(Float) type

rank

dataframe

apache-spark

apache-spark-sql

pyspark