重新模式以包括日期年份
re pattern to include year of dates
我对包含日期年份的 re 模式有一些疑问。
代码
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
all_dates=[]
for pattern in format_list:
all_dates = re.findall(pattern, text)
if all_dates == []:
continue
else:
for index,txt in enumerate(all_dates):
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', txt)
all_dates[index] = text
print(all_dates)
输出
['September 24 - 25, 2021', 'Mar 23 / 20187', 'Mar 25 / 20182']
期望的输出
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
问题
而不是 "…2018"
,我得到 "…20187"
和 "…20182"
。
只需从您的 format_list
中取出最后一个 (?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}
,它应该可以正常工作。只需使用下面的 format_list
。
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
此模式可以满足您的需要
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}
代码:
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = [
# r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}[\d\s\-\/,]*?\d{4}", # If you want to also match e.g. May 2020
r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}",
]
for pattern in format_list:
all_dates = re.findall(pattern, text, re.IGNORECASE)
print(all_dates)
输出:
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
其中:
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)
- 匹配月份的前缀
\w{0,6}
- 可选地匹配月份的全名,最长的是“sep”(来自上一个匹配项)+“tember”
\s+
- 匹配 1 个或多个 spaces.
[\d\s\-\/,]+?
- 匹配由 space、破折号或斜杠分隔的天数部分。
\d{4}
- 匹配年份部分。
请注意,由于正则表达式只是基于字符串的处理,因此您将受限于此处的格式 "mon day, year"
。您将需要其他模式来匹配不同的可能日期格式。您可能想要探索可以扫描文本的日期解析器。
我对包含日期年份的 re 模式有一些疑问。
代码
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
all_dates=[]
for pattern in format_list:
all_dates = re.findall(pattern, text)
if all_dates == []:
continue
else:
for index,txt in enumerate(all_dates):
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', txt)
all_dates[index] = text
print(all_dates)
输出
['September 24 - 25, 2021', 'Mar 23 / 20187', 'Mar 25 / 20182']
期望的输出
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
问题
而不是 "…2018"
,我得到 "…20187"
和 "…20182"
。
只需从您的 format_list
中取出最后一个 (?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}
,它应该可以正常工作。只需使用下面的 format_list
。
format_list = ["(?:(?:(?:j|J)an)|(?:(?:f|F)eb)|(?:(?:m|M)ar)|(?:(?:a|A)pr)|(?:(?:m|M)ay)|(?:(?:j|J)un)|(?:(?:j|J)ul)|(?:(?:a|A)ug)|(?:(?:s|S)ep)|(?:(?:o|O)ct)|(?:(?:n|N)ov)|(?:(?:d|D)ec))\w*(?:\s)?(?:\n)?[0-9]{1,2}(?:\s)?(?:\,|\.|\/|\-)?(?:\s)?[0-9]{2,4}"]
此模式可以满足您的需要
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}
代码:
import re
text ="May 2020 Musical Portraits September 24 - 25, 2021 Time: 8:00 pm Toledo Museum of Art Peristyle Romeo & JulietSpecial EventWhenFriday, Mar 23 / 20187:30pmBuy TicketsSunday, Mar 25 / 20182:30pmBuy TicketsWhereSamford University Wright CenterMap & DirectionsArtist"
format_list = [
# r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}[\d\s\-\/,]*?\d{4}", # If you want to also match e.g. May 2020
r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\w{0,6}\s+[\d\s\-\/,]+?\d{4}",
]
for pattern in format_list:
all_dates = re.findall(pattern, text, re.IGNORECASE)
print(all_dates)
输出:
['September 24 - 25, 2021', 'Mar 23 / 2018', 'Mar 25 / 2018']
其中:
(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)
- 匹配月份的前缀\w{0,6}
- 可选地匹配月份的全名,最长的是“sep”(来自上一个匹配项)+“tember”\s+
- 匹配 1 个或多个 spaces.[\d\s\-\/,]+?
- 匹配由 space、破折号或斜杠分隔的天数部分。\d{4}
- 匹配年份部分。
请注意,由于正则表达式只是基于字符串的处理,因此您将受限于此处的格式 "mon day, year"
。您将需要其他模式来匹配不同的可能日期格式。您可能想要探索可以扫描文本的日期解析器。