在 Pyspark 中对日期使用正则表达式函数
Using regex function on date in Pyspark
我需要验证 Pyspark Dataframe 中的日期(字符串格式),并且我需要删除日期中的其他字符和符号(如果存在)。如何验证这样的?
我看到了这段代码
regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()
下面是我的输出
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| |01/06/w2020|
| |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+
我的预期输出
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+
尝试regexp_replace删除额外的字符符号。
df.show()
# +-----------+
# | date|
# +-----------+
# |01/06/w2020|
# |02/06/2!020|
# | 02/06/2020|
# +-----------+
df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()
# +-----------+-----------+
# | date|cleaned_map|
# +-----------+-----------+
# |01/06/w2020| 01/06/2020|
# |02/06/2!020| 02/06/2020|
# | 02/06/2020| 02/06/2020|
# +-----------+-----------+
试试这个-
val df = Seq("01/06/w2020",
"02/06/2!020",
"02/06/2020",
"03/06/2020",
"04/06/2020",
"05/06/2020",
"02/06/2020",
"//01/0/4/202/0").toDF("date")
df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
.withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
.show(false)
/**
* +--------------+-----------+----------+
* |date |cleaned_map|date_type |
* +--------------+-----------+----------+
* |01/06/w2020 |01062020 |2020-06-01|
* |02/06/2!020 |02062020 |2020-06-02|
* |02/06/2020 |02062020 |2020-06-02|
* |03/06/2020 |03062020 |2020-06-03|
* |04/06/2020 |04062020 |2020-06-04|
* |05/06/2020 |05062020 |2020-06-05|
* |02/06/2020 |02062020 |2020-06-02|
* |//01/0/4/202/0|01042020 |2020-04-01|
* +--------------+-----------+----------+
*/
如果您想排除任何要删除的字符,请丰富此模式"[^0-9/T]"
我需要验证 Pyspark Dataframe 中的日期(字符串格式),并且我需要删除日期中的其他字符和符号(如果存在)。如何验证这样的?
我看到了这段代码
regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()
下面是我的输出
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| |01/06/w2020|
| |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+
我的预期输出
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+
尝试regexp_replace删除额外的字符符号。
df.show()
# +-----------+
# | date|
# +-----------+
# |01/06/w2020|
# |02/06/2!020|
# | 02/06/2020|
# +-----------+
df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()
# +-----------+-----------+
# | date|cleaned_map|
# +-----------+-----------+
# |01/06/w2020| 01/06/2020|
# |02/06/2!020| 02/06/2020|
# | 02/06/2020| 02/06/2020|
# +-----------+-----------+
试试这个-
val df = Seq("01/06/w2020",
"02/06/2!020",
"02/06/2020",
"03/06/2020",
"04/06/2020",
"05/06/2020",
"02/06/2020",
"//01/0/4/202/0").toDF("date")
df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
.withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
.show(false)
/**
* +--------------+-----------+----------+
* |date |cleaned_map|date_type |
* +--------------+-----------+----------+
* |01/06/w2020 |01062020 |2020-06-01|
* |02/06/2!020 |02062020 |2020-06-02|
* |02/06/2020 |02062020 |2020-06-02|
* |03/06/2020 |03062020 |2020-06-03|
* |04/06/2020 |04062020 |2020-06-04|
* |05/06/2020 |05062020 |2020-06-05|
* |02/06/2020 |02062020 |2020-06-02|
* |//01/0/4/202/0|01042020 |2020-04-01|
* +--------------+-----------+----------+
*/
如果您想排除任何要删除的字符,请丰富此模式"[^0-9/T]"