在 pyspark 数据框中的列中为 null 分配日期值

Question

我有一个 pyspark 数据框：

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        null        44.59%
CHN             null        null        21.13%

我正在从 Month 列（MM/dd/yyyy 格式）创建 New_Date 列。我需要为 Month 为空的行填充 New_date 值。

这就是我尝试过的：

df1=df.filter(col('Month').isNull()) \
.withColumn("current_date",current_date()) \
.withColumn("New_date", trunc(col("current_date"), "month"))

但是我得到的是当月的第一个日期。我需要 Month 列的第一个日期请建议任何其他方法。

Location        Month       New_Date    Sales
USA             1/1/2020    1/1/2020    34.56%
COL             1/1/2020    1/1/2020    66.4%
AUS             1/1/2020    1/1/2020    32.98%
NZ              null        1/1/2020    44.59%
CHN             null        1/1/2020    21.13%

Answer 1

您可以在 window 上使用 first 函数：

from pyspark.sql import functions as F, Window

w = (Window.orderBy("Month")
     .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
     )

df1 = df.withColumn(
    "New_date",
    F.coalesce(F.col("Month"), F.first("Month", ignorenulls=True).over(w))
)

df1.show()
#+--------+--------+--------+------+
#|Location|   Month|New_date| Sales|
#+--------+--------+--------+------+
#|      NZ|    null|1/1/2020|44.59%|
#|     CHN|    null|1/1/2020|21.13%|
#|     USA|1/1/2020|1/1/2020|34.56%|
#|     COL|1/1/2020|1/1/2020| 66.4%|
#|     AUS|1/1/2020|1/1/2020|32.98%|
#+--------+--------+--------+------+

在 pyspark 数据框中的列中为 null 分配日期值

Assign date values for null in a column in a pyspark dataframe

python

dataframe

apache-spark

apache-spark-sql

pyspark