如何在 Spark-Sql 的正则表达式中添加日期验证
How to add Date validation in regexp of Spark-Sql
spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
在上面我需要添加要求,例如
1.the 输出应使用 unix_timestamp.
格式“yyyymmdd
”进行验证
- 如果无效,则应通过将前四 (4) 个字符移动到提取的数字字符串 (
MMDDYYYY to YYYYMMDD
) 的末尾来转换提取的数字字符串,然后应使用 'yyyymmdd
'格式,如果满足这个条件就打印那个日期。
我不确定如何在查询中包含 Unix 时间戳。
示例输入和输出 1:
input: 2021dgsth02hdg02
output: 20210202
示例输入和输出 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
使用 unix_timestamp 代替 to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
试试下面的代码。
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
+----------------+
|dt |
+----------------+
|2021dgsth02hdg02|
|0101def20dr21 |
+----------------+
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z]+",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
在上面我需要添加要求,例如
1.the 输出应使用 unix_timestamp.
格式“yyyymmdd
”进行验证
- 如果无效,则应通过将前四 (4) 个字符移动到提取的数字字符串 (
MMDDYYYY to YYYYMMDD
) 的末尾来转换提取的数字字符串,然后应使用 'yyyymmdd
'格式,如果满足这个条件就打印那个日期。
我不确定如何在查询中包含 Unix 时间戳。
示例输入和输出 1:
input: 2021dgsth02hdg02
output: 20210202
示例输入和输出 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
使用 unix_timestamp 代替 to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
试试下面的代码。
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
+----------------+
|dt |
+----------------+
|2021dgsth02hdg02|
|0101def20dr21 |
+----------------+
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z]+",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+