根据条件从另一列创建一个值的列
Create a column with value from another column based on condition
我有一个数据框:
+---+---+---+------+
| id|foo|bar|rownum|
+---+---+---+------+
| 1|123|123| 1|
| 2|000|236| 1|
| 2|236|236| 2|
| 2|000|236| 3|
| 3|333|234| 1|
| 3|444|444| 2|
+---+---+---+------+
我想添加一个列 match
,它将包含 rownum
,其中 foo==bar
,例如:
+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
| A|123|123| 1| 1|
| B|000|236| 1| 2|
| B|236|236| 2| 2|
| B|000|236| 3| 2|
| R|333|234| 1| 2|
| R|444|444| 2| 2|
+---+---+---+------+----+
我试过这个:
df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))
使用这个:
df['match'] = df.loc[df['foo'] == df['bar']]['rownum']
但是如果他们不匹配的话returns 'NAN'
+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
| A|123|123| 1| 1|
| B|000|236| 1| NAN|
| B|236|236| 2| 2|
| B|000|236| 3| NAN|
| R|333|234| 1| NAN|
| R|444|444| 2| 2|
+---+---+---+------+----+
尝试使用 window 函数。
from pyspark.sql import functions as F, Window as W
df_grp2 = spark.createDataFrame(
[(1, '123', '123', 1),
(2, '000', '236', 1),
(2, '236', '236', 2),
(2, '000', '236', 3),
(3, '333', '234', 1),
(3, '444', '444', 2)],
['id', 'foo', 'bar', 'rownum']
)
df_grp2 = df_grp2.withColumn(
'match',
F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)
df_grp2.show()
# +---+---+---+------+-----+
# | id|foo|bar|rownum|match|
# +---+---+---+------+-----+
# | 1|123|123| 1| 1|
# | 2|000|236| 1| 2|
# | 2|236|236| 2| 2|
# | 2|000|236| 3| 2|
# | 3|333|234| 1| 2|
# | 3|444|444| 2| 2|
# +---+---+---+------+-----+
我有一个数据框:
+---+---+---+------+
| id|foo|bar|rownum|
+---+---+---+------+
| 1|123|123| 1|
| 2|000|236| 1|
| 2|236|236| 2|
| 2|000|236| 3|
| 3|333|234| 1|
| 3|444|444| 2|
+---+---+---+------+
我想添加一个列 match
,它将包含 rownum
,其中 foo==bar
,例如:
+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
| A|123|123| 1| 1|
| B|000|236| 1| 2|
| B|236|236| 2| 2|
| B|000|236| 3| 2|
| R|333|234| 1| 2|
| R|444|444| 2| 2|
+---+---+---+------+----+
我试过这个:
df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))
使用这个:
df['match'] = df.loc[df['foo'] == df['bar']]['rownum']
但是如果他们不匹配的话returns 'NAN'
+---+---+---+------+----+
| id|foo|bar|rownum|match
+---+---+---+------+----+
| A|123|123| 1| 1|
| B|000|236| 1| NAN|
| B|236|236| 2| 2|
| B|000|236| 3| NAN|
| R|333|234| 1| NAN|
| R|444|444| 2| 2|
+---+---+---+------+----+
尝试使用 window 函数。
from pyspark.sql import functions as F, Window as W
df_grp2 = spark.createDataFrame(
[(1, '123', '123', 1),
(2, '000', '236', 1),
(2, '236', '236', 2),
(2, '000', '236', 3),
(3, '333', '234', 1),
(3, '444', '444', 2)],
['id', 'foo', 'bar', 'rownum']
)
df_grp2 = df_grp2.withColumn(
'match',
F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)
df_grp2.show()
# +---+---+---+------+-----+
# | id|foo|bar|rownum|match|
# +---+---+---+------+-----+
# | 1|123|123| 1| 1|
# | 2|000|236| 1| 2|
# | 2|236|236| 2| 2|
# | 2|000|236| 3| 2|
# | 3|333|234| 1| 2|
# | 3|444|444| 2| 2|
# +---+---+---+------+-----+