spark to_date 函数 - 如何将 31-DEC-98 转换为 1998-12-31 而不是 2098-12-31
spark to_date function - how to convert 31-DEC-98 to 1998-12-31 not 2098-12-31
(Py)Spark to_date 将 31-DEC-98 转换为 2098-12-31。有办法让它成为 1998-12-31 吗?
文档没有 select 1000 或 2000 的选项。
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.
grade_type = spark.read\
.option("header", "true")\
.option("nullValue", "")\
.option("inferSchema", "true")\
.csv("student/GRADE_TYPE_DATA_TABLE.csv")
grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
| FI| Final| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
| HM| Homework| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
| MT| Midterm| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
+---------------+-----------+----------+------------+-----------+-------------+
grade_type = spark.read\
.option("header", "true")\
.option("nullValue", "")\
.option("inferSchema", "true")\
.csv("student/GRADE_TYPE_DATA_TABLE.csv")\
.withColumn("CREATED_DATE", to_date(col('CREATED_DATE'), "dd-MMM-yy"))\
.withColumn("MODIFIED_DATE", to_date(col('MODIFIED_DATE'), "dd-MMM-yy"))
grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
| FI| Final| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
| HM| Homework| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
| MT| Midterm| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
+---------------+-----------+----------+------------+-----------+-------------+
是的,但我认为您必须进行一些丑陋的字符串操作:
df.withColumn("MODIFIED_DATE",
to_date(concat(col("MODIFIED_DATE").substr(0, 7),
lit("19"),
col("MODIFIED_DATE").substr(8, 2)
), "dd-MMM-yyyy"))
我明白了(注意:使用 Scala,但 API 应该与 PySpark 相同):
scala> val df = Seq(("31-DEC-98")).toDF("MODIFIED_DATE")
scala> df.withColumn("new_date", to_date(concat(col("MODIFIED_DATE").substr(0, 7), lit("19"), col("MODIFIED_DATE").substr(8, 2)), "dd-MMM-yyyy")).show
+-------------+----------+
|MODIFIED_DATE| new_date|
+-------------+----------+
| 31-DEC-98|1998-12-31|
+-------------+----------+
在 Spark 3.0 上,引入了一个新的日期解析器,处理 2 位数年份的行为发生了变化。
您可以在 Upgrading from Spark SQL 2.4 to 3.0
下找到更改参考
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
将为您提供具有所需结果的原始行为
from pyspark.sql import functions as F
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
(spark.createDataFrame([('31-DEC-98',)], 'my_date string')
.select(F.to_date('my_date','dd-MMM-yy')
.alias('my_new_date')).show()
)
+-----------+
|my_new_date|
+-----------+
| 1998-12-31|
+-----------+
(Py)Spark to_date 将 31-DEC-98 转换为 2098-12-31。有办法让它成为 1998-12-31 吗?
文档没有 select 1000 或 2000 的选项。
to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.
grade_type = spark.read\
.option("header", "true")\
.option("nullValue", "")\
.option("inferSchema", "true")\
.csv("student/GRADE_TYPE_DATA_TABLE.csv")
grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
| FI| Final| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
| HM| Homework| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
| MT| Midterm| MCAFFREY| 31-DEC-98| MCAFFREY| 31-DEC-98|
+---------------+-----------+----------+------------+-----------+-------------+
grade_type = spark.read\
.option("header", "true")\
.option("nullValue", "")\
.option("inferSchema", "true")\
.csv("student/GRADE_TYPE_DATA_TABLE.csv")\
.withColumn("CREATED_DATE", to_date(col('CREATED_DATE'), "dd-MMM-yy"))\
.withColumn("MODIFIED_DATE", to_date(col('MODIFIED_DATE'), "dd-MMM-yy"))
grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
| FI| Final| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
| HM| Homework| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
| MT| Midterm| MCAFFREY| 2098-12-31| MCAFFREY| 2098-12-31|
+---------------+-----------+----------+------------+-----------+-------------+
是的,但我认为您必须进行一些丑陋的字符串操作:
df.withColumn("MODIFIED_DATE",
to_date(concat(col("MODIFIED_DATE").substr(0, 7),
lit("19"),
col("MODIFIED_DATE").substr(8, 2)
), "dd-MMM-yyyy"))
我明白了(注意:使用 Scala,但 API 应该与 PySpark 相同):
scala> val df = Seq(("31-DEC-98")).toDF("MODIFIED_DATE")
scala> df.withColumn("new_date", to_date(concat(col("MODIFIED_DATE").substr(0, 7), lit("19"), col("MODIFIED_DATE").substr(8, 2)), "dd-MMM-yyyy")).show
+-------------+----------+
|MODIFIED_DATE| new_date|
+-------------+----------+
| 31-DEC-98|1998-12-31|
+-------------+----------+
在 Spark 3.0 上,引入了一个新的日期解析器,处理 2 位数年份的行为发生了变化。
您可以在 Upgrading from Spark SQL 2.4 to 3.0
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
将为您提供具有所需结果的原始行为
from pyspark.sql import functions as F
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
(spark.createDataFrame([('31-DEC-98',)], 'my_date string')
.select(F.to_date('my_date','dd-MMM-yy')
.alias('my_new_date')).show()
)
+-----------+
|my_new_date|
+-----------+
| 1998-12-31|
+-----------+