Pyspark 添加新的开始年份列并处理闰年
Pyspark add new column of period starting year and handle leap years
我有一个数据框,我正在尝试添加一个包含 target_date
期间开始日期的列。但由于闰年开始日期,我得到了空值。在此感谢您的帮助。
+-----+----------+----------+------------+-------+-----------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|
+-----+----------+----------+------------+-------+-----------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021|
+-----+----------+----------+------------+-------+-----------------------+
预期输出:
+-----+----------+----------+------------+-------+-----------------------+--------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
我尝试了下面的代码,但没有得到正确的输出。
df.withColumn("period_starting_date", F.concat(F.col('target_date_fiscal_year'),
F.substring(F.col("start_date"), -6, 6)).cast('date')).show()
+-----+----------+----------+------------+-------+-----------------------+--------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| null|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
在 Python 中有一个名为 dateutil
的精美软件包,可以帮助您解决问题。
注意:您没有添加代码,因此无法检查这是否 100% 正确。
from dateutil.relativedelta import relativedelta
def delta_creator(df):
delta = df['target_date_fiscal_year'] - df['start_date'].dt.year
df['period_starting_date'] = df['start_date'] + relativedelta(years=delta)
return df
df = df.apply(delta_creator, axis=1)
您可以计算 target_date_fiscal_year
和 start_date
的年份之间的差异,然后将结果添加到 start_date
得到 period_starting_date
:
from pyspark.sql import functions as F
df1 = df.withColumn(
"period_starting_date",
F.to_date("start_date") + F.format_string(
"interval %s year", F.col("target_date_fiscal_year") - F.year("start_date")
).cast("interval")
)
df1.show()
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
#|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
#|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
#|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
#|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
我有一个数据框,我正在尝试添加一个包含 target_date
期间开始日期的列。但由于闰年开始日期,我得到了空值。在此感谢您的帮助。
+-----+----------+----------+------------+-------+-----------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|
+-----+----------+----------+------------+-------+-----------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021|
+-----+----------+----------+------------+-------+-----------------------+
预期输出:
+-----+----------+----------+------------+-------+-----------------------+--------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
我尝试了下面的代码,但没有得到正确的输出。
df.withColumn("period_starting_date", F.concat(F.col('target_date_fiscal_year'),
F.substring(F.col("start_date"), -6, 6)).cast('date')).show()
+-----+----------+----------+------------+-------+-----------------------+--------------------+
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| null|
+-----+----------+----------+------------+-------+-----------------------+--------------------+
在 Python 中有一个名为 dateutil
的精美软件包,可以帮助您解决问题。
注意:您没有添加代码,因此无法检查这是否 100% 正确。
from dateutil.relativedelta import relativedelta
def delta_creator(df):
delta = df['target_date_fiscal_year'] - df['start_date'].dt.year
df['period_starting_date'] = df['start_date'] + relativedelta(years=delta)
return df
df = df.apply(delta_creator, axis=1)
您可以计算 target_date_fiscal_year
和 start_date
的年份之间的差异,然后将结果添加到 start_date
得到 period_starting_date
:
from pyspark.sql import functions as F
df1 = df.withColumn(
"period_starting_date",
F.to_date("start_date") + F.format_string(
"interval %s year", F.col("target_date_fiscal_year") - F.year("start_date")
).cast("interval")
)
df1.show()
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+
#|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
#|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
#|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
#|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
#|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
#+-----+----------+----------+------------+-------+-----------------------+--------------------+