根据列当前值更新 pyspark 中的列

Question

假设给定一个 DataFrame

+-----+-----+-----+
|    x|    y|    z|
+-----|-----+-----+
|    3|    5|    9|
|    2|    4|    6|
+-----+-----+-----+

我想将 z 列中的所有值乘以 y 列中的值，其中 z 列等于 6。

post 显示了我的目标解决方案，使用代码

from pyspark.sql import functions as F

df = df.withColumn('z',
    F.when(df['z']==6, df['z']*df['y']).
    otherwise(df['z']))

问题是 df['z'] 和 df['y'] 被识别为 Column 对象并且转换它们将不起作用...

我怎样才能正确地做到这一点？

Answer 1

from pyspark.sql import functions as F
from pyspark.sql.types import LongType

df = df.withColumn('new_col', 
            F.when(df.z==6, 
                (df.z.cast(LongType()) * df.y.cast(LongType()))
            ).otherwise(df.z)
     )

根据列当前值更新 pyspark 中的列

Updating a column in pyspark dependent on the column current value

apache-spark

apache-spark-sql

pyspark

pyspark-sql