月日的正则表达式,错误的分隔符
Regex for Month-Day, Delimiter on Wrong Side
我有这个代码:
import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr"\s+(?=(?:{'|'.join(months)})\b)", re.I)
print(rx.split(x))
输出这个:
['John Doe,', 'Aug 5 2020 Hello Jane Doe:', 'Aug 5 2020']
我希望它输出这个:
["John Doe, Aug 5 2020", "Hello Jane Doe: Aug 5 2020"]
我该怎么做?预先感谢您的所有帮助!
您可以使用 findall
而不是 split
,方法如下:
>>> rx = re.compile(fr"\b\S.*?(?:{'|'.join(months)})" + r"\s+\d{1,2}\s+\d{4}", re.I)
>>> print(rx.findall(x))
['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']
在这个正则表达式中,我们从单词边界和非空白字符开始匹配并匹配任何内容,直到我们找到这个日期字符串,它是月份的交替,后跟日期和年份部分。
使用
import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr".*?\b(?:{'|'.join(months)})\s+\d+\s+\d+", re.I)
print([m.strip() for m in rx.findall(x)])
结果:['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']
参见 Python proof。
正则表达式:
.*?\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+\s+\d+
解释
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
Jan 'Jan'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Feb 'Feb'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Mar 'Mar'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Apr 'Apr'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
May 'May'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Jun 'Jun'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Jul 'Jul'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Aug 'Aug'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Sep 'Sep'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Oct 'Oct'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Nov 'Nov'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Dec 'Dec'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
我有这个代码:
import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr"\s+(?=(?:{'|'.join(months)})\b)", re.I)
print(rx.split(x))
输出这个:
['John Doe,', 'Aug 5 2020 Hello Jane Doe:', 'Aug 5 2020']
我希望它输出这个:
["John Doe, Aug 5 2020", "Hello Jane Doe: Aug 5 2020"]
我该怎么做?预先感谢您的所有帮助!
您可以使用 findall
而不是 split
,方法如下:
>>> rx = re.compile(fr"\b\S.*?(?:{'|'.join(months)})" + r"\s+\d{1,2}\s+\d{4}", re.I)
>>> print(rx.findall(x))
['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']
在这个正则表达式中,我们从单词边界和非空白字符开始匹配并匹配任何内容,直到我们找到这个日期字符串,它是月份的交替,后跟日期和年份部分。
使用
import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr".*?\b(?:{'|'.join(months)})\s+\d+\s+\d+", re.I)
print([m.strip() for m in rx.findall(x)])
结果:['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']
参见 Python proof。
正则表达式:
.*?\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+\s+\d+
解释
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
Jan 'Jan'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Feb 'Feb'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Mar 'Mar'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Apr 'Apr'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
May 'May'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Jun 'Jun'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Jul 'Jul'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Aug 'Aug'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Sep 'Sep'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Oct 'Oct'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Nov 'Nov'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
Dec 'Dec'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))