计算并比较两列的平均值

Compute and compare the average of two columns

我开始将我的 Pandas 实现转换为 pySpark,但我在执行一些基本操作时遇到了问题。所以我有这个 table:

+-----+-----+----+
| Col1|Col2 |Col3|
+-----+-----+----+
|  1  |[1,3]|   0|
|  44 |[2,0]|   1|
|  77 |[1,5]|   7|
+-----+-----+----+

我想要的输出是:

+-----+-----+----+----+
| Col1|Col2 |Col3|Col4|
+-----+-----+----+----+
|  1  |[1,3]|   0|2.67|
|  44 |[2,0]|   1|2.67|
|  77 |[1,5]|   7|2.67|
+-----+-----+----+----+

到达这里:

您可以使用 greatest 获取数组中每个(子)列的最大平均值:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'Col4',
    F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(2)])
)

df2.show()
+----+------+----+------------------+
|Col1|  Col2|Col3|              Col4|
+----+------+----+------------------+
|   1|[1, 3]|   0|2.6666666666666665|
|  44|[2, 0]|   1|2.6666666666666665|
|  77|[1, 5]|   7|2.6666666666666665|
+----+------+----+------------------+

如果你希望数组大小是动态的,你可以这样做

arr_size = df.select(F.max(F.size(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')))).head()[0]

df2 = df.withColumn(
    'Col4',
    F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(arr_size)])
)