Python 正则表达式在 2 个极端情况下失败

Question

我正在尝试编写一个正则表达式来将字符串拆分为我所说的 'terms'（例如单词、数字和周围的空格）和 'logical operators'（例如、<或者，|>，, <(,{,[),},]>)。对于这道题，我们可以忽略AND、OR、NOT的替代符号，分组就是用'('和')'。

例如：

Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)

应该分成这个 Python 列表：

["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]

我的代码：

pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"  
t = re.split(pattern, text)
raw_terms = list(filter(None, t))

该模式适用于此测试用例、上面的测试用例以及其他测试用例，

NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']

但不是这些：

NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']

我试过将两个\s+改成\s*，但不是所有的测试用例都通过了。我不是正则表达式专家（这是我尝试过的最复杂的）。

我希望有人能帮助我理解为什么这两个测试用例失败，以及如何修复正则表达式以便所有测试用例通过。

谢谢，

马克

Answer 1

使用

re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)

见regex proof。

说明

--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to :
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      AND                      'AND'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      OR                       'OR'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      NOT                      'NOT'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [()]                     any character of: '(', ')'
--------------------------------------------------------------------------------
  )                        end of 
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python code:

import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
output = list(filter(None, output))
print(output)

结果：['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']

Python 正则表达式在 2 个极端情况下失败

Python Regex Fails on 2 Edge Cases

regex

python-3.x

python-re