使用 lex 进行词法分析

Question

在我展示我在这里所做的之前是我尝试做的作业（我是新手所以我不确定我是否做对了）。

1.  Implement lexical analyzer (using FLEX), as follows:
- Lexical analyzer supplies services next_token(), back_token()
- Lexical analyzer reads text from the input file and identifies tokens. This 
   happens when function   next_token() is called. 
- When a token is identified in the input text, it should be stored in a data 
   structure. For each token, the following attributes are saved:
   * token type
   * token lexeme
   * number of the line in the input text in which this token was found.
- Blanks, tabs, new lines – are not tokens, and should be ignored
- For each token, print (on a separate line) its type (e.g. rel_op , number , etc.) 
   and lexeme 
- Each operation, keyword, separation sign and each type of number should be 
   implemented as a token of a different kind
- Kinds of tokens are coded with integer numbers, for example:
        # define  ID_tok  1
        # define COMMA_tok  2

使用 Flex 我写了这个：

%{
#include<stdio.h>
int line=1;
# define  ID_tok  1
# define COMMA_tok  2
#define REL_OP_tok 3
#define NUMBER_tok 4
#define KEYWORDS_tok 5
%}
binary_ar_op "*"|"/"|"+"|"-"
rel_op "=="|"!="|">"|"<"|"<="|">="
id      [a-z][a-z0-9]*
number    ["+"|"-"]?[0-9]*("."[0-9]+)?
keywords "prabegin"|"parend"|"task"|"begin"|"end"|"integer"|"real"|"do"|"until"|"od"|"send"|"accept"

%%
\n   {line++;  printf("\n%d:",line);}

{binary_ar_op}+ {printf( "An binary_ar_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{rel_op}+ {printf( "An rel_op: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}

{id}+ {printf( "An id: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}


{number}+ {printf( "An number: %s (%d) at line(%d)\n", yytext,
                   atoi( yytext ),line);}
%%
int yywrap()
{
return 1;
}
main()
{
printf("Enter a string of data\n");
yylex();
}

如你所见，我已经定义了我所支持的所有类型，我不明白如何实现 next 和 back，一些指导会很好，我也保存了行号，但我想他们的意思是其他然后我写的，我也不明白他们写的定义部分。

我知道这个问题看起来很奇怪，但是我之前没有任何指导或解释就得到了这个作业，所以我正在自己学习它，我并不是很了解（虽然我知道理论，谢谢！ ).

Answer 1

我在我们公司的项目中做了一些非常相似的事情。

关于代币

我为他们做了枚举...

关于next_token()

我的意图是将整个令牌相关信息存储到一个对象中：

实际令牌（枚举值）
词素 (a std::string)
文件位置（由文件名指针、行和列组成）。

另外，我想对这些生成的对象使用智能指针，更何况，它们应该是C++对象。

这是我意识到的：

重新定义yylex()函数很容易。因此，您甚至可以重命名它并更改其签名。
很难（如果不是不可能）将其与 yacc/bison 放在一起。主要问题是数据使用 C 联合（%union 如果我没记错的话）从 lex（生成的代码）传递到 yacc/bison（生成的代码）。 C 联合和 C++ 对象——效果不佳。（C 联合中的一个对象可能有效，但多个对象绝对不行。）

幸运的是，第二个问题对我来说实际上并不存在，因为我使用 flex 但编写（同时生成）递归下降解析器（直接在 C++ 中）。

那么，如何解决第一个问题呢？这是来自我的代码：

/// redefines proto for generated yylex function.
#define YY_DECL \
  RF::YAMS::Sim::ActionScript::RefToken \
  RF::YAMS::Sim::ActionScript::Compiler::lex(yyscan_t yyscanner)

这是 flex man page 您可以找到文档的地方。要查找如何重新定义 yylex 函数的说明，请在本网站上搜索 "YY_DECL".

我的解析器在需要新令牌时调用 lex()。

备注：

在我的例子中，我重命名了 yylex()，甚至使它成为我的解析器 class 的一个方法。（我这样做是为了简化词法分析器对私有解析器变量的访问。）
我提供了完整的作用域运算符，因为生成的 lex 代码不考虑我在 C++ 代码中使用的任何命名空间。
yyscan_t yyscanner 参数必须存在，因为我生成可重入扫描器。你必须决定它是否应该在那里。（您也可以提供其他参数。）
返回的RefToken是指向生成的token的智能指针。（智能指针使得在不同 "places" 中生成和使用令牌变得非常容易，而没有内存泄漏的危险。）

如果要将生成的词法分析器与 bison 生成的解析器结合起来，则可能不会那么容易。在这种情况下，我会使用静态变量并组织队列来将值从词法分析器传递到解析器。这会起作用，但当然不如上述方法优雅和保存。

关于back_token()

一旦你有了一个使用令牌的解析器，你就可以随心所欲地使用它们。就我而言，其中一项要求是回溯选项。因此，有时我不得不将令牌推回输入。为此，我只是将它们堆叠在解析器中。如果需要新令牌，我首先检查此堆栈是否为空。如果不是弹出最上面的标记，则调用 lex() 方法以获得真正的新标记。我想在您的情况下可以使用类似的解决方案来实施 back_token()。

关于空白

在我的词法分析器中实际上有两种类型的规则（即规则动作）：

以 return new Token(...);
以 break;

我使用后者来处理分隔符（即空格等）甚至注释（解析器甚至看不到它们）。这是可行的，因为词法分析器实际上只是包裹在 for() 循环中的 switch()。（我从 flex 文档中学到了这个 "trick"。在某处明确提到了它。）

还有什么...

除了YY_DECL，我还重新定义了YY_INPUT。我这样做是为了将词法分析器与 C++ std::stream（而不是 yyin）结合使用。

恕我直言，flex 确实提供了非常全面的手册。但是，每当我有疑问时，我都会查看 flex 生成的 C 文件。有限自动机有这些可怕的 int 数组，我通常会忽略它们。其余部分是它们周围的基础结构，您会发现嵌入某处的 C 操作（写在 lex 规则中）。检查周围的代码可能会使事情更清楚。

使用 lex 进行词法分析

Lexical analysis using lex

c

lex

lexical-analysis

lexical

flex-lexer