重新模式以包括日期年份

re pattern to include year of dates

我对包含日期年份的 re 模式有一些疑问。

代码

import re

text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle  Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]

all_dates=[]

for pattern in format_list:
    all_dates = re.findall(pattern, text)
    if all_dates == []:
        continue
    else:
        for index,txt in enumerate(all_dates):
            text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', txt)
            all_dates[index] = text
    print(all_dates)

输出

['September 24 - 25, 2021', 'Mar 23 / 20187', 'Mar 25 / 20182']

期望的输出

['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']

问题

而不是 "…2018",我得到 "…20187""…20182"

只需从您的 format_list 中取出最后一个 (?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4},它应该可以正常工作。只需使用下面的 format_list

format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]

此模式可以满足您的需要

(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}

代码:

import re

text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle  Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list  = [
    # r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}[\d\s\-\/,]*?\d{4}",  # If you want to also match e.g. May 2020
    r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}",
]

for pattern in format_list:
    all_dates = re.findall(pattern, text, re.IGNORECASE)
    print(all_dates)

输出:

['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']

其中:

  • (?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec) - 匹配月份的前缀
  • \w{0,6} - 可选地匹配月份的全名,最长的是“sep”(来自上一个匹配项)+“tember”
  • \s+ - 匹配 1 个或多个 spaces.
  • [\d\s\-\/,]+? - 匹配由 space、破折号或斜杠分隔的天数部分。
  • \d{4} - 匹配年份部分。

请注意,由于正则表达式只是基于字符串的处理,因此您将受限于此处的格式 "mon day, year"。您将需要其他模式来匹配不同的可能日期格式。您可能想要探索可以扫描文本的日期解析器。