Python 正则表达式在 2 个极端情况下失败
Python Regex Fails on 2 Edge Cases
我正在尝试编写一个正则表达式来将字符串拆分为我所说的 'terms'(例如单词、数字和周围的空格)和 'logical operators'(例如 、<或者,|>,, <(,{,[),},]>)。对于这道题,我们可以忽略AND、OR、NOT的替代符号,分组就是用'('和')'。
例如:
Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)
应该分成这个 Python 列表:
["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]
我的代码:
pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"
t = re.split(pattern, text)
raw_terms = list(filter(None, t))
该模式适用于此测试用例、上面的测试用例以及其他测试用例,
NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']
但不是这些:
NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']
我试过将两个\s+改成\s*,但不是所有的测试用例都通过了。我不是正则表达式专家(这是我尝试过的最复杂的)。
我希望有人能帮助我理解为什么这两个测试用例失败,以及如何修复正则表达式以便所有测试用例通过。
谢谢,
马克
使用
re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
说明
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
AND 'AND'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
OR 'OR'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
NOT 'NOT'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[()] any character of: '(', ')'
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
output = list(filter(None, output))
print(output)
结果:['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']
我正在尝试编写一个正则表达式来将字符串拆分为我所说的 'terms'(例如单词、数字和周围的空格)和 'logical operators'(例如
例如:
Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)
应该分成这个 Python 列表:
["Frank and Bob are nice", "AND", "NOT", "(", "Henry is good", "OR", "Sam is 102 years old", ")"]
我的代码:
pattern = r"(NOT|\-|\~)?\s*(\(|\[|\{)?\s*(NOT|\-|\~)?\s*([\w+\s*]*)\s+(AND|&|OR|\|)?\s+(NOT|\-|\~)?\s*([\w+\s*]*)\s*(\)|\]|\})?"
t = re.split(pattern, text)
raw_terms = list(filter(None, t))
该模式适用于此测试用例、上面的测试用例以及其他测试用例,
NOT Frank is a good boy AND Joe
raw_terms=['NOT', 'Frank is a good boy', 'AND', 'Joe']
但不是这些:
NOT Frank
raw_terms = ['NOT Frank']
NOT Frank is a good boy
raw_terms=['NOT Frank is a good boy']
我试过将两个\s+改成\s*,但不是所有的测试用例都通过了。我不是正则表达式专家(这是我尝试过的最复杂的)。
我希望有人能帮助我理解为什么这两个测试用例失败,以及如何修复正则表达式以便所有测试用例通过。
谢谢,
马克
使用
re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
说明
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to :
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
AND 'AND'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
OR 'OR'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
NOT 'NOT'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[()] any character of: '(', ')'
--------------------------------------------------------------------------------
) end of
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
import re
string = 'Frank and Bob are nice AND NOT (Henry is good OR Sam is 102 years old)'
output = re.split(r'\s*(\b(?:AND|OR|NOT)\b|[()])\s*', string)
output = list(filter(None, output))
print(output)
结果:['Frank and Bob are nice', 'AND', 'NOT', '(', 'Henry is good', 'OR', 'Sam is 102 years old', ')']