如何迭代一个组并使用 Pyspark 创建一个数组列?
How to iterate over a group and create an array column with Pyspark?
我有一个包含组和百分比的数据框
| Group | A % | B % | Target % |
| ----- | --- | --- | -------- |
| A | .05 | .85 | 1.0 |
| A | .07 | .75 | 1.0 |
| A | .08 | .95 | 1.0 |
| B | .03 | .80 | 1.0 |
| B | .05 | .83 | 1.0 |
| B | .04 | .85 | 1.0 |
我希望能够按列 Group
迭代列 A %
并从列 B %
中找到一个值数组,当与列 [=15= 中的每个值相加时] 小于或等于列 Target %
.
| Group | A % | B % | Target % | SumArray |
| ----- | --- | --- | -------- | ------------ |
| A | .05 | .85 | 1.0 | [.85,.75,.95]|
| A | .07 | .75 | 1.0 | [.85,.75] |
| A | .08 | .95 | 1.0 | [.85,.75] |
| B | .03 | .80 | 1.0 | [.80,.83,.85]|
| B | .05 | .83 | 1.0 | [.80,.83,.85]|
| B | .04 | .85 | 1.0 | [.80,.83,.85]|
我希望能够使用 PySpark 来解决这个问题。有什么想法可以解决这个问题吗?
您可以根据您的条件 collect_list
function to get an array of B %
column values grouped by Group
column then filter
使用结果数组 A + B <= Target
:
from pyspark.sql import Window
import pyspark.sql.functions as F
df2 = df.withColumn(
"SumArray",
F.collect_list(F.col("B")).over(Window.partitionBy("Group"))
).withColumn(
"SumArray",
F.expr("filter(SumArray, x -> x + A <= Target)")
)
df2.show()
# +-----+----+----+------+------------------+
# |Group| A| B|Target| SumArray|
# +-----+----+----+------+------------------+
# | B|0.03| 0.8| 1.0| [0.8, 0.83, 0.85]|
# | B|0.05|0.83| 1.0| [0.8, 0.83, 0.85]|
# | B|0.04|0.85| 1.0| [0.8, 0.83, 0.85]|
# | A|0.05|0.85| 1.0|[0.85, 0.75, 0.95]|
# | A|0.07|0.75| 1.0| [0.85, 0.75]|
# | A|0.08|0.95| 1.0| [0.85, 0.75]|
# +-----+----+----+------+------------------+
我有一个包含组和百分比的数据框
| Group | A % | B % | Target % |
| ----- | --- | --- | -------- |
| A | .05 | .85 | 1.0 |
| A | .07 | .75 | 1.0 |
| A | .08 | .95 | 1.0 |
| B | .03 | .80 | 1.0 |
| B | .05 | .83 | 1.0 |
| B | .04 | .85 | 1.0 |
我希望能够按列 Group
迭代列 A %
并从列 B %
中找到一个值数组,当与列 [=15= 中的每个值相加时] 小于或等于列 Target %
.
| Group | A % | B % | Target % | SumArray |
| ----- | --- | --- | -------- | ------------ |
| A | .05 | .85 | 1.0 | [.85,.75,.95]|
| A | .07 | .75 | 1.0 | [.85,.75] |
| A | .08 | .95 | 1.0 | [.85,.75] |
| B | .03 | .80 | 1.0 | [.80,.83,.85]|
| B | .05 | .83 | 1.0 | [.80,.83,.85]|
| B | .04 | .85 | 1.0 | [.80,.83,.85]|
我希望能够使用 PySpark 来解决这个问题。有什么想法可以解决这个问题吗?
您可以根据您的条件 collect_list
function to get an array of B %
column values grouped by Group
column then filter
使用结果数组 A + B <= Target
:
from pyspark.sql import Window
import pyspark.sql.functions as F
df2 = df.withColumn(
"SumArray",
F.collect_list(F.col("B")).over(Window.partitionBy("Group"))
).withColumn(
"SumArray",
F.expr("filter(SumArray, x -> x + A <= Target)")
)
df2.show()
# +-----+----+----+------+------------------+
# |Group| A| B|Target| SumArray|
# +-----+----+----+------+------------------+
# | B|0.03| 0.8| 1.0| [0.8, 0.83, 0.85]|
# | B|0.05|0.83| 1.0| [0.8, 0.83, 0.85]|
# | B|0.04|0.85| 1.0| [0.8, 0.83, 0.85]|
# | A|0.05|0.85| 1.0|[0.85, 0.75, 0.95]|
# | A|0.07|0.75| 1.0| [0.85, 0.75]|
# | A|0.08|0.95| 1.0| [0.85, 0.75]|
# +-----+----+----+------+------------------+