在 Pyspark 中对日期使用正则表达式函数

Using regex function on date in Pyspark

我需要验证 Pyspark Dataframe 中的日期(字符串格式),并且我需要删除日期中的其他字符和符号(如果存在)。如何验证这样的?

我看到了这段代码

regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()

下面是我的输出

+-----------+-----------+
|cleaned_map|       date|
+-----------+-----------+
|           |01/06/w2020|
|           |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+

我的预期输出

+-----------+-----------+
|cleaned_map|       date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+

尝试regexp_replace删除额外的字符符号。

    df.show()

    # +-----------+
    # |       date|
    # +-----------+
    # |01/06/w2020|
    # |02/06/2!020|
    # | 02/06/2020|
    # +-----------+

 df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()


    # +-----------+-----------+
    # |       date|cleaned_map|
    # +-----------+-----------+
    # |01/06/w2020| 01/06/2020|
    # |02/06/2!020| 02/06/2020|
    # | 02/06/2020| 02/06/2020|
    # +-----------+-----------+

试试这个-

    val df = Seq("01/06/w2020",
    "02/06/2!020",
    "02/06/2020",
    "03/06/2020",
    "04/06/2020",
    "05/06/2020",
    "02/06/2020",
    "//01/0/4/202/0").toDF("date")
    df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
      .withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
      .show(false)

    /**
      * +--------------+-----------+----------+
      * |date          |cleaned_map|date_type |
      * +--------------+-----------+----------+
      * |01/06/w2020   |01062020   |2020-06-01|
      * |02/06/2!020   |02062020   |2020-06-02|
      * |02/06/2020    |02062020   |2020-06-02|
      * |03/06/2020    |03062020   |2020-06-03|
      * |04/06/2020    |04062020   |2020-06-04|
      * |05/06/2020    |05062020   |2020-06-05|
      * |02/06/2020    |02062020   |2020-06-02|
      * |//01/0/4/202/0|01042020   |2020-04-01|
      * +--------------+-----------+----------+
      */

如果您想排除任何要删除的字符,请丰富此模式"[^0-9/T]"