根据条件从另一列创建一个值的列

Create a column with value from another column based on condition

我有一个数据框:

+---+---+---+------+
| id|foo|bar|rownum|
+---+---+---+------+
|  1|123|123|     1|
|  2|000|236|     1|
|  2|236|236|     2|
|  2|000|236|     3|
|  3|333|234|     1|
|  3|444|444|     2|
+---+---+---+------+

我想添加一个列 match,它将包含 rownum,其中 foo==bar,例如:

+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
|  A|123|123|     1|   1|
|  B|000|236|     1|   2|
|  B|236|236|     2|   2|
|  B|000|236|     3|   2|
|  R|333|234|     1|   2|
|  R|444|444|     2|   2|
+---+---+---+------+----+

我试过这个:

df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))

使用这个:

df['match'] = df.loc[df['foo'] == df['bar']]['rownum']

但是如果他们不匹配的话returns 'NAN'

+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
|  A|123|123|     1|   1|
|  B|000|236|     1| NAN|
|  B|236|236|     2|   2|
|  B|000|236|     3| NAN|
|  R|333|234|     1| NAN|
|  R|444|444|     2|   2|
+---+---+---+------+----+

尝试使用 window 函数。

from pyspark.sql import functions as F, Window as W

df_grp2 = spark.createDataFrame(
    [(1, '123', '123', 1),
     (2, '000', '236', 1),
     (2, '236', '236', 2),
     (2, '000', '236', 3),
     (3, '333', '234', 1),
     (3, '444', '444', 2)],
    ['id', 'foo', 'bar', 'rownum']
)

df_grp2 = df_grp2.withColumn(
    'match',
    F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)

df_grp2.show()
# +---+---+---+------+-----+
# | id|foo|bar|rownum|match|
# +---+---+---+------+-----+
# |  1|123|123|     1|    1|
# |  2|000|236|     1|    2|
# |  2|236|236|     2|    2|
# |  2|000|236|     3|    2|
# |  3|333|234|     1|    2|
# |  3|444|444|     2|    2|
# +---+---+---+------+-----+