to_date 在格式 yyyyww 上给出空值(202001 和 202053)
to_date gives null on format yyyyww (202001 and 202053)
我有一个包含年周列的数据框,我想将其转换为日期。我写的代码似乎每周都有效,除了“202001”和“202053”这周,例如:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
这几周我不知道错误是什么或如何解决。如何将 202001 周和 202053 周转换为有效日期?
在 Spark 中处理 ISO week 确实令人头疼 - 事实上,这个功能在 Spark 3 中已被弃用(删除?)。我认为在 UDF 中使用 Python 日期时间实用程序是一种更灵活的方法这个。
import datetime
import pyspark.sql.functions as F
@F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+
根据 mck 的回答,这是我最终用于 Python 版本 3.5.2 的解决方案:
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
@F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
如果不使用 3.6 中添加的“%G%V%u”,我必须从日期中减去一周才能得到正确的日期。
以下不会使用udf
,而是更高效的向量化pandas_udf
:
import pandas as pd
@F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+
我有一个包含年周列的数据框,我想将其转换为日期。我写的代码似乎每周都有效,除了“202001”和“202053”这周,例如:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
这几周我不知道错误是什么或如何解决。如何将 202001 周和 202053 周转换为有效日期?
在 Spark 中处理 ISO week 确实令人头疼 - 事实上,这个功能在 Spark 3 中已被弃用(删除?)。我认为在 UDF 中使用 Python 日期时间实用程序是一种更灵活的方法这个。
import datetime
import pyspark.sql.functions as F
@F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+
根据 mck 的回答,这是我最终用于 Python 版本 3.5.2 的解决方案:
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
@F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
如果不使用 3.6 中添加的“%G%V%u”,我必须从日期中减去一周才能得到正确的日期。
以下不会使用udf
,而是更高效的向量化pandas_udf
:
import pandas as pd
@F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+