PLY

Question

我正在使用 PLY 编写一个简单的解析器。我的评论可以是这样的

# this is a single line comment \
with an escaped new line

我的尝试是在这里使用状态。我有

states = (
    ('COMMENT', 'exclusive'),
)
tokens = ('COMMENT')

def t_begin_COMMENT(t):
    r'\#'
    t.lexer.begin('COMMENT')


def t_COMMENT_contents(t):
    r'.|\\n'


t_COMMENT_ignore = r' '

def t_COMMENT_error(t):
    pass


def t_COMMENT_end(t):
    r'\n'
    t.lexer.begin('INITIAL')

当我做的时候

lexer = lex.lex()
string = "# test \\ns \n4"
lexer.input(string)
for tok in lexer:
    print(tok)

它应该打印 4（我有另一个标记，但现在不相关）但我得到 s 和 4，其中 s 仍然是评论。如何为内容编写正则表达式？这是因为 COMMENT 以 \n 结尾吗？

Answer 1

Python 正则表达式不会产生最长匹配。 Python 正则表达式中的交替 (|) 是 有序的 ；如果您使用模式 .|\\n，那么 . 将始终匹配（除非字符串为空），因此永远不会尝试 \\n。如果没有转义符号，这更容易看到：

>>> import re
>>> re.match(r'.|ab', 'ab')
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.match(r'ab|.', 'ab')
<_sre.SRE_Match object; span=(0, 2), match='ab'>

我一点都不清楚为什么你要去做所有这些工作，而不是使用一个正则表达式而不必求助于词法分析器状态。

def t_comment(t):
    r'\#(\\n|.)*\n'
    pass

（注意：我更喜欢正则表达式 r'\#(\[\s\S]|.)*'，它允许 \ 转义任何内容，包括它本身。您使用的正则表达式不允许您在评论行的结尾：

# This will continue, perhaps unexpectedly: \
still a comment

另外，尾随的 \n 无论如何都会被忽略，所以没有明显的理由将它包含在模式中，如果注释正好在输入的末尾，它可能无法匹配输入不以换行符终止。

PLY - 转义 C 风格注释中的新行

PLY - escaping new line in C-style comments