如何使用递归正则表达式或其他方法递归验证 Python 中的类似 BBcode 的标记？

Question

我正在尝试编写一个程序来验证用类似于 BBcode 的标记语言编写的文档。

此标记语言同时具有匹配 ([b]bold[/b] text) 和非匹配 (today is [date]) 标签。不幸的是，使用不同的标记语言不是一种选择。

但是，我的正则表达式没有按照我想要的方式运行。它似乎总是停在第一个匹配的结束标记处，而不是用递归 (?R).

识别嵌套标记

我正在使用 regex 模块，它支持 (?R)，而不是 re。

我的问题是：

如何有效地使用递归正则表达式来匹配嵌套标签而不在第一个标签处终止？
如果有比正则表达式更好的方法，那是什么方法？

这是我构建后的正则表达式： \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\](?:((?!\[\/\]).)*?|(?R))*\[\/\]

这是一个没有按预期工作的测试字符串： [large]test1 [large]test2[/large] test3[/large]（它应该匹配整个字符串但在 test3 之前停止）

这是 regex101.com 上的正则表达式：https://regex101.com/r/laJSLZ/1

这个测试不需要在几毫秒甚至几秒内完成，但它确实需要能够在对 Travis-CI 构建。

下面是使用此正则表达式的逻辑，对于上下文：

import io, regex # https://pypi.org/project/regex/

# All the tags that must have opening and closing tags
matching_tags = 'b', 'i', 'u', 'h1', 'h2', 'h3', 'large', 'small', 'list', 'table', 'grid'

# our first part matches an opening tag:
# \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\]
# our middle part matches the text in the middle, including any properly formed tag sets in between:
# (?:((?!\[\/\]).)*?|(?R))*
# our last part matches the closing tag for our first match:
# \[\/\]
pattern = r'\[(' + '|'.join(matching_tags) + r')\](?:((?!\[\/\]).)*?|(?R))*\[\/\]'
myRegex = re.compile(pattern)

data = ''
with open('input.txt', 'r') as file:
    data = '[br]'.join(file.readlines())

def validate(text):
    valid = True
    for node in all_nodes(text):
        valid = valid and is_valid(node)
    return valid

# (Only important thing here is that I call this on every node, this
# should work fine but the regex to get me those nodes does not.)
# markup should be valid iff opening and closing tag counts are equal
# in the whole file, in each matching top-level pair of tags, and in
# each child all the way down to the smallest unit (a string that has
# no tags at all)
def is_valid(text):
    valid = True
    for tag in matching_tags:
        valid = valid and text.count(f'[{tag}]') == text.count(f'[/{tag}]')
    return valid

# this returns each child of the text given to it
# this call:
# all_nodes('[b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b] to use [b]regex to [i]do stuff[/i][/b]')
# should return a list containing these strings:
# [b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b]
# [large]text to[/large]
# [i]with [u]regex[/u]![/i]
# [u]regex[/u]
# [b]regex to [i]do stuff[/i][/b]
# [i]do stuff[/i]
def all_nodes(text):
    matches = myRegex.findall(text)
    if len(matches) > 0:
        for m in matches:
            result += all_nodes(m)
    return result

exit(0 if validate(data) else 1)

Answer 1

您的主要问题出在 ((?!\[\/\]).)*? tempered greedy 令牌中。

首先，它是低效的，因为你量化它然后量化它所在的整个组，所以让正则表达式引擎寻找更多的方法来匹配一个字符串，这使得它相当脆弱。

其次，你只匹配结束标签，没有限制起始标签。第一步是使 </code> 之前的 <code>/ 可选，\/?。它不会像没有属性的标签那样在 [tag] 之前停止。要添加属性支持，请在 </code>、<code>(?:\s[^]]*)? 之后添加一个可选组。它匹配一个可选的空格序列，然后匹配 ].

以外的任何 0+ 个字符

固定的正则表达式看起来像

\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?(?:\s[^]]*)?]).|(?R))*\[/]

不要忘记用 regex.DOTALL 编译它以匹配多个换行符。

如何使用递归正则表达式或其他方法递归验证 Python 中的类似 BBcode 的标记？

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

python

regex

regex-lookarounds

regex-recursion

python-regex