在字符串中捕获不是令牌的所有内容

Question

上下文： 我正在处理布尔和算术表达式的混合，可能类似于以下示例：

b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))

我想匹配并提取公式中包含的形状 A <= B 的任何约束，该约束必须始终为真。在上面的示例中，只有 0 <= x_1 会满足此条件。

当前目标： 我的想法是构建一个仅关注以下标记的输入公式的简单解析树：and (/\), or (\/), left bracket (() and right bracket ())。鉴于上述公式，我想生成以下 AST：

/\
|_ "b_1"
|_ /\
   |_ "0 <= x_1"
   |_ \/
      |_ "x_2 <= 2"
      |_ /\
         |_ "b_3"
         |_ "(/ 1 3) <= x_4"

然后，我可以简单地遍历 AST 并丢弃任何以 \/ 为根的子树。

我的尝试：

查看this documentation，我定义词法分析器的语法如下：

import ply.lex as lex

tokens = (
    "LPAREN",
    "RPAREN",
    "AND",
    "OR",
    "STRING",
)

t_AND    = r'\/\'
t_OR     = r'\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

t_ignore = ' \t\n'

def t_error(t):
    print(t)
    print("Illegal character '{}'".format(t.value[0]))
    t.lexer.skip(1)

def t_STRING(t):
    r'^(?!\)|\(| |\t|\n|\\/|\/\)'
    t.value = t
    return t

data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"

lexer = lex.lex()

lexer.input(data)

while True:
    tok = lexer.token()
    if not tok:
        break
    print(tok.type, tok.value, tok.lineno, tok.lexpos)

但是，我得到以下输出：

~$ python3 lex.py
LexToken(error,'b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \/ (b_3 /\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44

t_STRING 标记没有被正确识别。

问题：如何为 t_STRING 设置 catch all 正则表达式以获得有效的分词器？

Answer 1

您的 T_STRING 正则表达式肯定不会满足您的要求。它的作用有点难以回答。

原则上，它只包含两个零长度断言：^，它只在字符串的开头为真（除非你提供 re.MULTILINE 标志，你不t), 和一个很长的负前瞻断言。

仅包含零长度断言的模式只能匹配空字符串，如果它完全匹配任何内容的话。但是不允许词法分析器模式匹配空字符串。词法分析器将输入分成一系列标记，以便输入中的每个字符都属于某个标记。每场比赛——它们都是比赛，而不是搜索——恰好从上一场比赛的结尾开始。所以如果一个模式可以匹配空字符串，词法分析器会在同一个地方尝试下一个匹配，并得到相同的结果，这将是一个无限循环。

一些词法分析器生成器通过使用内置的包罗万象的错误模式强制进行最少的一个字符匹配来解决这个问题，但是如果模式与空字符串匹配，Ply 只是拒绝生成词法分析器。然而 Ply 并没有抱怨这个词法分析器规范。唯一可能的解释是该模式无法匹配任何内容。

关键是 Ply 使用 re.VERBOSE 标志编译所有模式，这允许您用白色 space 分隔正则表达式中的项目，使正则表达式的可读性稍微好一些。正如 Python 文档所示：

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.

Whitespace 包括换行符甚至注释（以 # 字符开头），因此您可以将模式拆分为多行并插入关于每一部分的注释。

事实上，我们可以用您的模式做到这一点：

def t_STRING(t):
    r'''^         # Anchor this match at the beginning of the input
        (?!       # Don't match if the next characters match:
           \)   | # Close parenthesis
           \(   | # Open parenthesis
           \    | # !!! HERE IS THE PROBLEM
           \t   | # Tab character
           \n   | # Newline character
           \\/ | # \/ token
           \/\   # /\ token
        )
     '''
    t.value = t
    return t

因此，当我向您的模式添加 whitespace 和注释时，我不得不注意到原始模式试图匹配 space 字符作为 | | 的替代字符。但是由于模式被编译为 re.VERBOSE，space 字符被忽略，留下一个空的替代项，它匹配空字符串。该替代方案是否定先行断言的一部分，这意味着如果在该点匹配的字符串以空字符串开头，则断言将失败。当然，每个字符串 都以空字符串开头，因此否定先行断言总是失败，这解释了为什么 Ply 没有抱怨（以及为什么模式从不匹配任何东西）。

不管那个特定的故障，该模式都没有用，因为如前所述，词法分析器模式必须匹配某些字符，因此仅匹配空字符串的模式没有用。我们要做的是匹配任何字符，前提是否定前瞻（已更正，如下所示）允许。所以这意味着否定先行断言显示后跟 .，它将匹配下一个字符。

但您几乎肯定不想只匹配一个字符。大概你想匹配一串不匹配任何其他标记的字符。因此，这意味着将否定先行断言和以下 . 放入重复中。请记住，它必须是非空重复（+，而不是 *），因为模式不能有空匹配项。

最后，使用锚断言绝对没有意义，因为这会将模式限制为仅在输入的开头匹配，而这肯定不是您想要的。根本不清楚它在那里做什么。（我看到建议使用具有负面前瞻性的锚 search，我认为这通常是错误的，但讨论超出了这个问题的范围。）

在我们写模式之前，让我们再做一个调整：在一个 Python 正则表达式中，如果你可以用一个字符 class 替换一组备选方案，你应该这样做，因为它更有效率。即使只能替换部分替代方案也是如此。

因此产生以下结果：

def t_STRING(t):
    r'''(
         (?!            # Don't match if the next characters match:
            [() \t\n] |   # Parentheses or whitespace
            \\/      |   # \/ token
            \/\          # /\ token
         ) .            # If none of the above match, accept a character
        )+              # and repeat as many times as possible (at least once)
     '''
    return t

我删除了 t.value = t。 t是token对象，不是字符串，值应该是它匹配的字符串。如果您用循环引用覆盖该值，您将无法找出匹配的字符串。

这行得通，但与您预期的方式不尽相同。由于白色 space 字符被排除在 T_STRING 之外，您不会得到表示 (/ 1 3) <= x_4 的单个标记。相反，您会得到一系列令牌：

STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44

但我认为这是合理的。词法分析器如何能够分辨出 (x_2 <= 2 和 (b_3 中的括号是括号标记，而 (/ 1 3) <= x_4 中的括号是 T_STRING 的一部分？需要在您的解析器中做出该决定。

事实上，我倾向于将输入完全标记化，即使您（还）不需要完全标记化。正如整个问答所示，尝试识别 "everything but..." 实际上可能比仅识别所有标记要复杂得多。试图让分词器找出哪些分词有用，哪些分词没用通常 比分词并通过解析器传递更难。

Answer 2

基于 @rici, pointing the problem with t_STRING, this is my final version of the example that introduces smaller changes to the one proposed by @rici 的出色回答。

代码

##############
# TOKENIZING #
##############

tokens = (
    "LPAREN",
    "RPAREN",
    "AND",
    "OR",
    "STRING",
)

def t_AND(t):
    r'[ ]*\/\[ ]*'
    t.value = "/\"
    return t

def t_OR(t):
    r'[ ]*\\/[ ]*'
    t.value = "\/"
    return t

def t_LPAREN(t):
    r'[ ]*\([ ]*'
    t.value = "("
    return t

def t_RPAREN(t):
    r'[ ]*\)[ ]*'
    t.value = ")"
    return t

def t_STRING(t):
    r'''(
         (?!            # Don't match if the next characters match:
            [()\t\n] |   # Parentheses or whitespace
            \\/     |   # \/ token
           \/\          # /\ token
         ) .            # If none of the above match, accept a character
        )+              # and repeat as many times as possible (at least once)
     '''
    return t

def t_error(t):
    print("error: " + str(t.value[0]))
    t.lexer.skip(1)

import ply.lex as lex
lexer = lex.lex()

data = "b_b /\ (ccc <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"
lexer.input(data)

while True:
    tok = lexer.token()
    if not tok:
        break
    print("{0}: `{1}`".format(tok.type, tok.value))

输出

STRING: `b_b `
AND: `/\`
LPAREN: `(`
STRING: `ccc <= 2 `
OR: `\/`
LPAREN: `(`
STRING: `b_3 `
AND: `/\`
LPAREN: `(`
LPAREN: `(`
STRING: `/ 1 3`
RPAREN: `)`
STRING: `<= x_4`
RPAREN: `)`
RPAREN: `)`

在字符串中捕获不是令牌的所有内容

capture in a string everything that is not a token

regex

parsing

lex

ply

python-3.x