正则表达式匹配等量的两个字符
Regex Match equal amount of two characters
我想使用正则表达式将任何函数的参数匹配为字符串。作为示例,假设以下字符串:
predicate(foo(x.bar, predicate(foo(...), bar)), bar)
这可能是较长序列的一部分
predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)
我现在想找到代表 function/predicate 及其参数的所有子字符串(即在第一个示例中,整个字符串以及嵌套的 predicate(foo(...), bar)
)。问题是我不能像这样简单地匹配
predicate\(.*, bar\)
因为如果 *
是贪婪的,我可能会匹配比谓词参数更多的参数,如果它是惰性的,那么我可能会匹配比谓词参数更多的参数。这是因为这样的 predicates() 可以嵌套。
我需要一个正则表达式来找到字符串 predicate(...)
,其中 ...
匹配包含等量 (
和 )
的任何字符串(惰性).
如果重要:我在 python.
中将正则表达式与 re 模块一起使用
import re
def parse(s):
pattern = re.compile(r'([^(),]+)|\s*([(),])\s*')
stack = []
state = 0 # 0 = before identifier, 1 = after identifier, 2 = after closing paren
current = None
args = []
for match in pattern.finditer(s):
if match.group(1):
if state != 0:
raise SyntaxError("Expected identifier at {0}".format(match.start()))
current = match.group(1)
state = 1
elif match.group(2) == '(':
if state != 1:
raise SyntaxError("Unexpected open paren at {0}".format(match.start()))
stack.append((args, current))
state = 0
current = None
args = []
elif match.group(2) == ',':
if state != 0: args.append(current)
state = 0
current = None
elif match.group(2) == ')':
if state != 0: args.append(current)
if len(stack) == 0:
raise SyntaxError("Unmatched paren at {0}".format(match.start()))
newargs = args
args, current = stack.pop()
current = (current, newargs)
state = 2
if state != 0: args.append(current)
if len(stack) > 0:
raise SyntaxError("Unclosed paren")
return args
>>> from pprint import pprint
>>> pprint(parse('predicate(foo(x.bar, predicate(foo(...), bar)), bar)'), width=1)
[('predicate',
[('foo',
['x.bar',
('predicate',
[('foo',
['...']),
'bar'])]),
'bar'])]
它 returns 所有以逗号分隔的顶级表达式的列表。函数调用变成名称和参数的元组。
您可以创建一个正则表达式来查找代码中的所有函数调用。像这样:
([_a-zA-Z]+)(?=\()
然后使用 re
模块,创建一个数据结构来索引代码中的函数调用。
import re
code = 'predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)'
code_cp = code
regex = re.compile(r'([_a-zA-Z]+)(?=\()')
matches = re.findall(regex, code)
structured_matches = []
for m in matches:
beg = str.index(code, m)
end = beg + len(m)
structured_matches.append((m, beg, end))
code = code[:beg] + '_' * len(m) + code[end:]
这将为您提供如下所示的数据结构:
[
('predicate', 0, 9),
('foo', 10, 13),
('predicate', 21, 30),
('foo', 31, 34),
('predicate', 52, 61),
('foo', 62, 65),
('predicate', 73, 82),
('foo', 83, 86),
('predicate', 104, 113),
('foo', 114, 117),
('predicate', 125, 134),
('foo', 135, 138)
]
您可以将此数据结构与 parse
函数结合使用,以提取每个函数调用的括号的内容。
def parse(string):
stack = []
contents = ''
opened = False
for c in string:
if len(stack) > 0:
contents += c
if c == '(':
opened = True
stack.append('o')
elif c == ')':
stack.pop()
if opened and len(stack) == 0:
break
return contents[:-1]
paren_contents = []
for m in structured_matches:
fn_name, beg, end = m
paren_contents.append((fn_name, parse(code_cp[end:])))
最后,paren_contents
应该是这样的:
[
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'), ('foo', '...'),
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'), ('foo', '...'),
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'),
('foo', '...')
]
希望这能为您指明正确的方向。
添加 PyPI package regex, as @Tim Pietzcker suggested, you can use recursive regexes.
>>> import regex
>>> s = 'predicate(foo(x.bar, predicate(foo(...), bar)), bar)'
>>> pattern = regex.compile(r'(\w+)(?=\(((?:\w+\((?2)\)|[^()])*)\))')
>>> pattern.findall(s)
[('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'),
('foo', '...')]
您也可以限制它只查找 "predicate":
>>> pattern = regex.compile(r'(predicate)(?=\(((?:\w+\((?2)\)|[^()])*)\))')
>>> pattern.findall(s)
[('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('predicate', 'foo(...), bar')]
我想使用正则表达式将任何函数的参数匹配为字符串。作为示例,假设以下字符串:
predicate(foo(x.bar, predicate(foo(...), bar)), bar)
这可能是较长序列的一部分
predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)
我现在想找到代表 function/predicate 及其参数的所有子字符串(即在第一个示例中,整个字符串以及嵌套的 predicate(foo(...), bar)
)。问题是我不能像这样简单地匹配
predicate\(.*, bar\)
因为如果 *
是贪婪的,我可能会匹配比谓词参数更多的参数,如果它是惰性的,那么我可能会匹配比谓词参数更多的参数。这是因为这样的 predicates() 可以嵌套。
我需要一个正则表达式来找到字符串 predicate(...)
,其中 ...
匹配包含等量 (
和 )
的任何字符串(惰性).
如果重要:我在 python.
中将正则表达式与 re 模块一起使用import re
def parse(s):
pattern = re.compile(r'([^(),]+)|\s*([(),])\s*')
stack = []
state = 0 # 0 = before identifier, 1 = after identifier, 2 = after closing paren
current = None
args = []
for match in pattern.finditer(s):
if match.group(1):
if state != 0:
raise SyntaxError("Expected identifier at {0}".format(match.start()))
current = match.group(1)
state = 1
elif match.group(2) == '(':
if state != 1:
raise SyntaxError("Unexpected open paren at {0}".format(match.start()))
stack.append((args, current))
state = 0
current = None
args = []
elif match.group(2) == ',':
if state != 0: args.append(current)
state = 0
current = None
elif match.group(2) == ')':
if state != 0: args.append(current)
if len(stack) == 0:
raise SyntaxError("Unmatched paren at {0}".format(match.start()))
newargs = args
args, current = stack.pop()
current = (current, newargs)
state = 2
if state != 0: args.append(current)
if len(stack) > 0:
raise SyntaxError("Unclosed paren")
return args
>>> from pprint import pprint
>>> pprint(parse('predicate(foo(x.bar, predicate(foo(...), bar)), bar)'), width=1)
[('predicate',
[('foo',
['x.bar',
('predicate',
[('foo',
['...']),
'bar'])]),
'bar'])]
它 returns 所有以逗号分隔的顶级表达式的列表。函数调用变成名称和参数的元组。
您可以创建一个正则表达式来查找代码中的所有函数调用。像这样:
([_a-zA-Z]+)(?=\()
然后使用 re
模块,创建一个数据结构来索引代码中的函数调用。
import re
code = 'predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)predicate(foo(x.bar, predicate(foo(...), bar)), bar)'
code_cp = code
regex = re.compile(r'([_a-zA-Z]+)(?=\()')
matches = re.findall(regex, code)
structured_matches = []
for m in matches:
beg = str.index(code, m)
end = beg + len(m)
structured_matches.append((m, beg, end))
code = code[:beg] + '_' * len(m) + code[end:]
这将为您提供如下所示的数据结构:
[
('predicate', 0, 9),
('foo', 10, 13),
('predicate', 21, 30),
('foo', 31, 34),
('predicate', 52, 61),
('foo', 62, 65),
('predicate', 73, 82),
('foo', 83, 86),
('predicate', 104, 113),
('foo', 114, 117),
('predicate', 125, 134),
('foo', 135, 138)
]
您可以将此数据结构与 parse
函数结合使用,以提取每个函数调用的括号的内容。
def parse(string):
stack = []
contents = ''
opened = False
for c in string:
if len(stack) > 0:
contents += c
if c == '(':
opened = True
stack.append('o')
elif c == ')':
stack.pop()
if opened and len(stack) == 0:
break
return contents[:-1]
paren_contents = []
for m in structured_matches:
fn_name, beg, end = m
paren_contents.append((fn_name, parse(code_cp[end:])))
最后,paren_contents
应该是这样的:
[
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'), ('foo', '...'),
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'), ('foo', '...'),
('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'),
('foo', '...')
]
希望这能为您指明正确的方向。
添加 PyPI package regex, as @Tim Pietzcker suggested, you can use recursive regexes.
>>> import regex
>>> s = 'predicate(foo(x.bar, predicate(foo(...), bar)), bar)'
>>> pattern = regex.compile(r'(\w+)(?=\(((?:\w+\((?2)\)|[^()])*)\))')
>>> pattern.findall(s)
[('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('foo', 'x.bar, predicate(foo(...), bar)'),
('predicate', 'foo(...), bar'),
('foo', '...')]
您也可以限制它只查找 "predicate":
>>> pattern = regex.compile(r'(predicate)(?=\(((?:\w+\((?2)\)|[^()])*)\))')
>>> pattern.findall(s)
[('predicate', 'foo(x.bar, predicate(foo(...), bar)), bar'),
('predicate', 'foo(...), bar')]