Lex 程序在使用 OR 语句时无法识别单词

Question

我是运行下面的 lex 程序，它可以很好地识别关于猫的句子：

%{
        #include <iostream>
        #include <cstdio>
        #include <cstdlib>
        using namespace std;
        extern "C" int yylex();
%}

SP      [ ]+
ARTICLE "le" /* Line I am trying to change */
COMMUN "chat"
VERBE "est"
NOIR "noir"

PHRASE {ARTICLE}{SP}{COMMUN}{SP}{VERBE}{SP}{NOIR}


%%

^{PHRASE}\n     { cout << "Une phrase : " << yytext << '\n'; }

\n              { cout << '\n'; }

^.*\n           { cout << "Ligne inconnue : " << yytext << '\n'; }

%%

int main(int argc, char *argv[])
{
        ++argv, --argc;  
        if(argc > 0)
                yyin = fopen(argv[0], "r");
        else
        yyin = stdin;

        yylex();
} /* main() */

我得到以下输出：

Ligne inconnue : le professeur est Jean

Ligne inconnue : le professeur a un ordinateur

Ligne inconnue : Jean aime Linux

**Une phrase : le chat est noir**

Ligne inconnue : les etudiants ont des ordinateurs

但是，当我尝试向程序中添加 OR 语句时（对于 ARTICLE），cat 语句不再被识别：

%{
        #include <iostream>
        #include <cstdio>
        #include <cstdlib>
        using namespace std;
        extern "C" int yylex();
%}

SP      [ ]+
ARTICLE "le"|"la" /* Line I am trying to change */
COMMUN "chat" 
VERBE "est"
NOIR "noir"

PHRASE {ARTICLE}{SP}{COMMUN}{SP}{VERBE}{SP}{NOIR}


%%

^{PHRASE}\n     { cout << "Une phrase : " << yytext << '\n'; }

\n              { cout << '\n'; }

^.*\n           { cout << "Ligne inconnue : " << yytext << '\n'; }

%%

int main(int argc, char *argv[])
{
        ++argv, --argc;  
        if(argc > 0)
                yyin = fopen(argv[0], "r");
        else
        yyin = stdin;

        yylex();
}

这将给我以下输出：

Ligne inconnue : le professeur est Jean

Ligne inconnue : le professeur a un ordinateur

Ligne inconnue : Jean aime Linux

**Ligne inconnue : le chat est noir**

Ligne inconnue : les etudiants ont des ordinateurs

输入文件只是一个包含以下行的文本文件：

le professeur est Jean

le professeur a un ordinateur

Jean aime Linux

le chat est noir

les etudiants ont des ordinateurs

谁能告诉我为什么这行不通？我已经尝试了我可以在网上找到的 OR 语句的所有变体，但仍然没有任何效果。

谢谢！

Answer 1

实现了 flex -l 标志，以便可以继续处理真正旧的 lex 规范，否则这些规范将无法工作。对于任何 newly-written 扫描仪，您真的不需要那个标志。这个特殊问题是一个常见原因。

问题出在宏展开的处理上：flex做了common-sense的事情，避免了很多常见的错误； lex（和flex -l），但是，让你更容易用宏定义来射击你的脚。

以防万一，lex 所谓的“定义”实际上是一个宏。就像 C 预处理器宏一样，lex 宏引入了许多潜在的误解。

我想几乎每个使用过预处理器的 C 程序员都偶然发现了这个陷阱：

#define SUM(a,b) a+b    // DON'T DO THIS, EVER

尽管您可能会在某些情况下成功使用此宏，但您最终会发现

int c = SUM(a,b) * 2;

计算出 a+b*2 而不是预期的 (a+b)*2。那是因为宏替换只是符号替换；如果宏中没有括号，则会生成 none。

这也是 lex 的工作方式，也是 Posix 标准所说的它应该工作的方式。但很多年前，flex 的作者意识到 no-one 非常希望像下面这样的定义按照他们的方式工作：

ARTICLE "le"|"la"
%%
{ARTICLE}" chat"  { /* Matches either "le" or "la chat" */ }

因此 flex（通常）会自动插入所需的括号，就好像您已将 ARTICLE 正确定义为：

ARTICLE ("le"|"la")

但是，这与原始 lex 不兼容，并且它可能会破坏依赖于原始 annoying-literal 语义的旧 lex 程序。

所以flex提供了-l（“Lex兼容性”）选项，可以用来处理这些旧的lex程序。但是，正如我所说，它不应该用于任何新的 lex 程序。

为了防止上述内容不够令人信服，这并不是由 -l 标志保留的原始 lex 做出的唯一错误选择。另一个是计数重复运算符 {m,n} 的奇怪运算符优先级。在 flex 中，

ab*   ab+   ab?   ab{0,3}

均值分别为：

“一个 a 后跟零个或多个 bs”
“一个 a 后跟一个或多个 b”
"一个 a 后跟一个可选的 b"
“从零到三次重复ab”

Flex 通过使括号重复的运算符优先级与任何其他重复运算符的运算符优先级相同来修复这种不一致，这无疑是每个人都期望的。同样，-l 标志恢复为原始 lex 规范。

最后，-l 选项使默认的 yytext 声明成为数组而不是指针。虽然这可以使一些事情变得更容易，但总的来说它带来了一些重要的缺点，包括：

慢了很多。
它阻止扫描器调整其缓冲区大小以应对长令牌

底线：不要使用 flex -l 选项（在我们讨论这个主题时，也不要使用 bison -y 选项），除非您需要它来编译遗留代码。

Lex 程序在使用 OR 语句时无法识别单词

Lex program won't recognize word when using OR statement

c

c++

yacc

lex