月日的正则表达式,错误的分隔符

Regex for Month-Day, Delimiter on Wrong Side

我有这个代码:

import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr"\s+(?=(?:{'|'.join(months)})\b)", re.I)
print(rx.split(x))

输出这个:

['John Doe,', 'Aug 5 2020 Hello Jane Doe:', 'Aug 5 2020']

我希望它输出这个:

["John Doe, Aug 5 2020", "Hello Jane Doe: Aug 5 2020"]

我该怎么做?预先感谢您的所有帮助!

您可以使用 findall 而不是 split,方法如下:

>>> rx = re.compile(fr"\b\S.*?(?:{'|'.join(months)})" + r"\s+\d{1,2}\s+\d{4}", re.I)
>>> print(rx.findall(x))
['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']

在这个正则表达式中,我们从单词边界和非空白字符开始匹配并匹配任何内容,直到我们找到这个日期字符串,它是月份的交替,后跟日期和年份部分。

RegEx Demo

使用

import re
x = "John Doe, Aug 5 2020 Hello Jane Doe: Aug 5 2020"
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rx = re.compile(fr".*?\b(?:{'|'.join(months)})\s+\d+\s+\d+", re.I)
print([m.strip() for m in rx.findall(x)])

结果['John Doe, Aug 5 2020', 'Hello Jane Doe: Aug 5 2020']

参见 Python proof

正则表达式:

.*?\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+\s+\d+

解释

--------------------------------------------------------------------------------
  .*?                      any character except \n (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    Jan                      'Jan'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Feb                      'Feb'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Mar                      'Mar'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Apr                      'Apr'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    May                      'May'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Jun                      'Jun'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Jul                      'Jul'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Aug                      'Aug'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Sep                      'Sep'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Oct                      'Oct'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Nov                      'Nov'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    Dec                      'Dec'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))