Python 用于捕获定界符内文本的解析器组合器
Python parser combinator for capturing text inside delimiters
我正在查看 Python 中的一些解析器组合器库(更准确地说是 Parsy),我目前面临以下问题,通过一个最低限度的工作示例进行了简化下面:
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
start, stop = r"STARTS?", r"STOPS?"
s = section(text, start, stop)
print(s)
应该输出:
THE TEXT HERE SHOULD
BE CAPTURED
我正在使用的当前解决方案是做一个正则表达式前瞻,它工作正常,但我最初的问题涉及组合许多这些小的正则表达式,这可能会变得混乱并且其他人以后需要维护。
from typing import Pattern, TypeVar
import re
# A Generic type declaration.
T = TypeVar("T")
def first(text: str, pattern: str, default: T, flags=0) -> T:
"""
Given a `text`, a regex `pattern` and a `default` value, return the first match
in `text`. Otherwise return a `default` value if no match is found.
"""
match = re.findall(pattern, text, flags=flags)
return match[0] if len(match) > 0 else default
def section(text: str, begin: str, end: str) -> str:
"""
Given a `text` and two `start` and `stop` regexes, return the captured group
found in the interval. Otherwise, return an empty string if no match is found.
"""
return first(text, fr"{begin}([\s\S]*?)(?={end})", default="")
解析器组合器似乎非常适合此类情况,但我无法重现与工作解决方案相同的行为,欢迎提供任何提示:
# A Simpler example with hardcoded stuff
from parsy import regex, seq, string
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
start = regex(r"STARTS?")
middle = regex(r"[\s\S]*").optional()
stop = regex(r"STOPS?")
eol = string("\n")
# Work fine
start.parse("START")
middle.parse("")
stop.parse("STOP")
section = seq(
start,
middle,
stop
)
# Simpler case, breaks
section.parse("START AAA STOP")
给出:
---------------------------------------------------------------------------
ParseError Traceback (most recent call last)
<ipython-input-260-fdec112e1648> in <module>
24 )
25 # Simpler case, breaks
---> 26 section.parse("START AAA STOP")
~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse(self, stream)
88 def parse(self, stream):
89 """Parse a string or list of tokens and return the result or raise a ParseError."""
---> 90 (result, _) = (self << eof).parse_partial(stream)
91 return result
92
~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse_partial(self, stream)
102 return (result.value, stream[result.index:])
103 else:
--> 104 raise ParseError(result.expected, stream, result.furthest)
105
106 def bind(self, bind_fn):
ParseError: expected 'STOPS?' at 0:14
您尝试过使用拆分吗?
根据我对你们项目需求的了解。我会这样做:
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
# split text at START and take the second part of the text
# Then split the result by STOP and take the first part of the text
s = text.split('START')[1].split('STOP')[0]
print (s)
问题是 middle
解析器匹配文本直到结束,因此 stop
解析器没有任何内容可以使用:
seq(start, middle).parse("START AAA STOP")
打印
['START', ' AAA STOP']
避免此行为的一种解决方案是对 middle
正则表达式使用先行选项:
middle = regex(r"[\s\S]*(?=STOP)").optional()
这可确保匹配的文本后跟“STOP”字样。
或者,您可以使用 Parsy 中的 should_fail
方法:
middle = (regex(r"STOPS?").should_fail("not STOP") >> any_char).many().concat()
我正在查看 Python 中的一些解析器组合器库(更准确地说是 Parsy),我目前面临以下问题,通过一个最低限度的工作示例进行了简化下面:
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
start, stop = r"STARTS?", r"STOPS?"
s = section(text, start, stop)
print(s)
应该输出:
THE TEXT HERE SHOULD
BE CAPTURED
我正在使用的当前解决方案是做一个正则表达式前瞻,它工作正常,但我最初的问题涉及组合许多这些小的正则表达式,这可能会变得混乱并且其他人以后需要维护。
from typing import Pattern, TypeVar
import re
# A Generic type declaration.
T = TypeVar("T")
def first(text: str, pattern: str, default: T, flags=0) -> T:
"""
Given a `text`, a regex `pattern` and a `default` value, return the first match
in `text`. Otherwise return a `default` value if no match is found.
"""
match = re.findall(pattern, text, flags=flags)
return match[0] if len(match) > 0 else default
def section(text: str, begin: str, end: str) -> str:
"""
Given a `text` and two `start` and `stop` regexes, return the captured group
found in the interval. Otherwise, return an empty string if no match is found.
"""
return first(text, fr"{begin}([\s\S]*?)(?={end})", default="")
解析器组合器似乎非常适合此类情况,但我无法重现与工作解决方案相同的行为,欢迎提供任何提示:
# A Simpler example with hardcoded stuff
from parsy import regex, seq, string
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
start = regex(r"STARTS?")
middle = regex(r"[\s\S]*").optional()
stop = regex(r"STOPS?")
eol = string("\n")
# Work fine
start.parse("START")
middle.parse("")
stop.parse("STOP")
section = seq(
start,
middle,
stop
)
# Simpler case, breaks
section.parse("START AAA STOP")
给出:
---------------------------------------------------------------------------
ParseError Traceback (most recent call last)
<ipython-input-260-fdec112e1648> in <module>
24 )
25 # Simpler case, breaks
---> 26 section.parse("START AAA STOP")
~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse(self, stream)
88 def parse(self, stream):
89 """Parse a string or list of tokens and return the result or raise a ParseError."""
---> 90 (result, _) = (self << eof).parse_partial(stream)
91 return result
92
~/.venv/lib/python3.8/site-packages/parsy/__init__.py in parse_partial(self, stream)
102 return (result.value, stream[result.index:])
103 else:
--> 104 raise ParseError(result.expected, stream, result.furthest)
105
106 def bind(self, bind_fn):
ParseError: expected 'STOPS?' at 0:14
您尝试过使用拆分吗?
根据我对你们项目需求的了解。我会这样做:
text = '''
AAAAAAAAAA AAAAAAAA AAAAAAAAAAAAAA
BBBBBBB START THE TEXT HERE SHOULD
BE CAPTURED STOP CCCCCCCCCC CCCCCC
'''
# split text at START and take the second part of the text
# Then split the result by STOP and take the first part of the text
s = text.split('START')[1].split('STOP')[0]
print (s)
问题是 middle
解析器匹配文本直到结束,因此 stop
解析器没有任何内容可以使用:
seq(start, middle).parse("START AAA STOP")
打印
['START', ' AAA STOP']
避免此行为的一种解决方案是对 middle
正则表达式使用先行选项:
middle = regex(r"[\s\S]*(?=STOP)").optional()
这可确保匹配的文本后跟“STOP”字样。
或者,您可以使用 Parsy 中的 should_fail
方法:
middle = (regex(r"STOPS?").should_fail("not STOP") >> any_char).many().concat()