Spark 日期格式问题
Spark date format issue
我在 spark 日期格式中观察到奇怪的行为。实际上我需要将日期 yy
转换为 yyyy
。日期转换后它应该是 20yy in date
我试过如下,2040年之后就失败了。
import org.apache.spark.sql.functions._
val df= Seq(("06/03/35"),("07/24/40"), ("11/15/43"), ("12/15/12"), ("11/15/20"), ("12/12/22")).toDF("Date")
df.withColumn("newdate", from_unixtime(unix_timestamp($"Date", "mm/dd/yy"), "mm/dd/yyyy")).show
+--------+----------+
| Date| newdate|
+--------+----------+
| 06/3/35|06/03/2035|
|07/24/40|07/24/2040|
|11/15/43|11/15/1943| // Here year appended with 19
|12/15/12|12/15/2012|
|11/15/20|11/15/2020|
|12/12/22|12/12/2022|
+--------+----------+
为什么会出现这种情况,有没有我可以直接使用的日期实用函数,而无需将 20 附加到字符串日期
解析 2 位数年份字符串受 SimpleDateFormat
docs:
中记录的一些相关解释的约束
For parsing with the abbreviated year pattern ("y" or "yy"), SimpleDateFormat must interpret the abbreviated year relative to some century. It does this by adjusting dates to be within 80 years before and 20 years after the time the SimpleDateFormat instance is created. For example, using a pattern of "MM/dd/yy" and a SimpleDateFormat instance created on Jan 1, 1997, the string "01/11/12" would be interpreted as Jan 11, 2012 while the string "05/04/64" would be interpreted as May 4, 1964.
所以,2043
距现在已有 20 多年了,解析器使用 1943 作为记录。
这是一种使用 UDF 的方法,它在解析日期之前在 SimpleDateFormat
对象上显式调用 set2DigitYearStart
(我选择 1980 年作为示例):
def parseDate(date: String, pattern: String): Date = {
val format = new SimpleDateFormat(pattern);
val cal = Calendar.getInstance();
cal.set(Calendar.YEAR, 1980)
val beginning = cal.getTime();
format.set2DigitYearStart(beginning)
return new Date(format.parse(date).getTime);
}
然后:
val custom_to_date = udf(parseDate _);
df.withColumn("newdate", custom_to_date($"Date", lit("mm/dd/yy"))).show(false)
+--------+----------+
|Date |newdate |
+--------+----------+
|06/03/35|2035-01-03|
|07/24/40|2040-01-24|
|11/15/43|2043-01-15|
|12/15/12|2012-01-15|
|11/15/20|2020-01-15|
|12/12/22|2022-01-12|
+--------+----------+
了解您的数据,您就会知道为 set2DigitYearStart()
的参数选择哪个值
我在 spark 日期格式中观察到奇怪的行为。实际上我需要将日期 yy
转换为 yyyy
。日期转换后它应该是 20yy in date
我试过如下,2040年之后就失败了。
import org.apache.spark.sql.functions._
val df= Seq(("06/03/35"),("07/24/40"), ("11/15/43"), ("12/15/12"), ("11/15/20"), ("12/12/22")).toDF("Date")
df.withColumn("newdate", from_unixtime(unix_timestamp($"Date", "mm/dd/yy"), "mm/dd/yyyy")).show
+--------+----------+
| Date| newdate|
+--------+----------+
| 06/3/35|06/03/2035|
|07/24/40|07/24/2040|
|11/15/43|11/15/1943| // Here year appended with 19
|12/15/12|12/15/2012|
|11/15/20|11/15/2020|
|12/12/22|12/12/2022|
+--------+----------+
为什么会出现这种情况,有没有我可以直接使用的日期实用函数,而无需将 20 附加到字符串日期
解析 2 位数年份字符串受 SimpleDateFormat
docs:
For parsing with the abbreviated year pattern ("y" or "yy"), SimpleDateFormat must interpret the abbreviated year relative to some century. It does this by adjusting dates to be within 80 years before and 20 years after the time the SimpleDateFormat instance is created. For example, using a pattern of "MM/dd/yy" and a SimpleDateFormat instance created on Jan 1, 1997, the string "01/11/12" would be interpreted as Jan 11, 2012 while the string "05/04/64" would be interpreted as May 4, 1964.
所以,2043
距现在已有 20 多年了,解析器使用 1943 作为记录。
这是一种使用 UDF 的方法,它在解析日期之前在 SimpleDateFormat
对象上显式调用 set2DigitYearStart
(我选择 1980 年作为示例):
def parseDate(date: String, pattern: String): Date = {
val format = new SimpleDateFormat(pattern);
val cal = Calendar.getInstance();
cal.set(Calendar.YEAR, 1980)
val beginning = cal.getTime();
format.set2DigitYearStart(beginning)
return new Date(format.parse(date).getTime);
}
然后:
val custom_to_date = udf(parseDate _);
df.withColumn("newdate", custom_to_date($"Date", lit("mm/dd/yy"))).show(false)
+--------+----------+
|Date |newdate |
+--------+----------+
|06/03/35|2035-01-03|
|07/24/40|2040-01-24|
|11/15/43|2043-01-15|
|12/15/12|2012-01-15|
|11/15/20|2020-01-15|
|12/12/22|2022-01-12|
+--------+----------+
了解您的数据,您就会知道为 set2DigitYearStart()