Pyspark 添加新的开始年份列并处理闰年

Pyspark add new column of period starting year and handle leap years

我有一个数据框,我正在尝试添加一个包含 target_date 期间开始日期的列。但由于闰年开始日期,我得到了空值。在此感谢您的帮助。

+-----+----------+----------+------------+-------+-----------------------+
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|
+-----+----------+----------+------------+-------+-----------------------+
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|
+-----+----------+----------+------------+-------+-----------------------+

预期输出:

+-----+----------+----------+------------+-------+-----------------------+--------------------+
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date| 
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|          2021-02-28|
+-----+----------+----------+------------+-------+-----------------------+--------------------+

我尝试了下面的代码,但没有得到正确的输出。

df.withColumn("period_starting_date", F.concat(F.col('target_date_fiscal_year'),
 F.substring(F.col("start_date"), -6, 6)).cast('date')).show()
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date| 
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|                null|
+-----+----------+----------+------------+-------+-----------------------+--------------------+

在 Python 中有一个名为 dateutil 的精美软件包,可以帮助您解决问题。

注意:您没有添加代码,因此无法检查这是否 100% 正确。

from dateutil.relativedelta import relativedelta

def delta_creator(df):
    delta = df['target_date_fiscal_year'] - df['start_date'].dt.year 
    df['period_starting_date'] = df['start_date'] + relativedelta(years=delta)
    return df
    
df = df.apply(delta_creator, axis=1)

您可以计算 target_date_fiscal_yearstart_date 的年份之间的差异,然后将结果添加到 start_date 得到 period_starting_date:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "period_starting_date",
    F.to_date("start_date") + F.format_string(
        "interval %s year", F.col("target_date_fiscal_year") - F.year("start_date")
    ).cast("interval")
)

df1.show()

#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
#|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
#|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
#|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
#|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|          2021-02-28|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+