Spark：无法根据从另一列填充一列空值的输出创建新列

Question

我试图用 ColX 的值填充 ColY 的空值，同时将输出存储为我的 DataFrame Col_new 中的新列。我在数据块中使用 pyspark，但是我对此还很陌生。

示例数据如下：

ColX              ColY  
apple             orange
pear              null
grapefruit        pear
apple             null

所需的输出如下所示：

ColX              ColY              Col_new
apple             orange            orange  
pear              null              pear
grapefruit        pear              pear
apple             null              apple

我试了好几行代码都没用。我最近的尝试如下：

.withColumn("Col_new", col('ColX').select(coalesce('ColY')))

如有任何帮助，我们将不胜感激。非常感谢。

Answer 1

ColY 和 ColX 两列都应作为 coalesce 的参数提供：

df = spark.createDataFrame([
  ("apple", "orange"),
  ("pear", None),
  ("grapefruit", "pear"),
  ("apple", None)
]).toDF("ColX", "ColY")

from pyspark.sql.functions import coalesce

df.withColumn("ColNew", coalesce("ColY", "ColX")).show()
+----------+------+------+
|      ColX|  ColY|ColNew|
+----------+------+------+
|     apple|orange|orange|
|      pear|  null|  pear|
|grapefruit|  pear|  pear|
|     apple|  null| apple|
+----------+------+------+

Answer 2

coalesce 将 return 列列表中的第一个非空值。您只传递一列，因此 coalesce 无效。

在这种情况下正确的语法是：

from pyspark.sql.functions import coalesce
df = df.withColumn("Col_new", coalesce('ColY', 'ColX'))

这意味着取 ColY 的值，除非它是 null，在这种情况下取 ColX.

的值

在这种情况下，您还可以使用when 等价逻辑：

from pyspark.sql.functions import when

df = df.withColumn(
    "Col_new", 
    when(col("ColY").isNull(), col("ColX")).otherwise(col("ColY"))
)

Spark：无法根据从另一列填充一列空值的输出创建新列

Spark: Cannot create new column from the output of filling one column null values from another

apache-spark

pyspark

databricks