柔性。在预处理器指令之后检测字符
Flex. Detect characters after preprocessor directives
我正在尝试开发一个词法分析器来检测预处理器指令和 "code to analyze"。
我希望分析器检测处理器指令和标识符、整数常量等(但前提是这些元素位于处理器指令的同一行)和 "code to analyze"(不在同一行的行)指令行)。
例如,对于 txt 文件中的下一个代码,
#define B 0
#ifdef C
#if D > ( 0 + 1 )
main(){
printf(“Hello”);
}
我要检测以下元素
- 指令:#define、#ifdef、#if
- 标识符:B、C、D
- 整数常量:0、1
- 符号:( , )
- 关系运算符:>
- 算术运算符:+
- 要分析的代码:main(){, printf(“Hello”); , }
这是我实现分析器的代码:
%{
/*Libraries Declaration */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*Functions Headers */
/*Global variables */
%}
/** Regular Expressions Definitions */
TAB [ \t]+
DIG [0-9]
RESERV_WORD #define|#elif|#else|#endif|#if|#ifdef|#ifndef|#undef
DIR [^#]
OP_RELA {DIR}">"|">="|"<"|"<="|"=="|"!="
OP_ARIT {DIR}"+"|"-"|"*"|"/"|"%"
SYMBOL {DIR}"("|")"
INT_CTE {DIR}{DIG}+
SYMBOLYC_CTE {DIR}("\"")(.*)("\"")
IDENTIFIER {DIR}[A-Z]{1,8}
CODE_TO_ANALY ^[^#].*
/* Traduction rules*/
%option noyywrap
%%
{TAB} { }
{CODE_TO_ANALY} {
printf("[%s] is code to analyze\n",yytext);
}
{OP_RELA} { //Detect relational operators
printf("[%s] is relational operator\n",yytext);
}
{OP_ARIT} {
printf("[%s] is arith operator \n",yytext);
}
{RESERV_WORD} { //Detect reserved words
printf("[%s] is a reserved word\n",yytext);
}
{INT_CTE} { //Detect integer constants
printf("[%s] is an integer constant\n",yytext);
}
{SYMBOL} { //Detecta special symbols
printf("[%s] is a special symbol \n",yytext);
}
{SYMBOLYC_CTE} { //Detecta symbolic constants
printf("[%s] is a symbolic constant\n",yytext);
}
{IDENTIFIER} { //Detect identifiers
printf("[%s] is an identifier\n",yytext);
}
. {}
%%
int main(int argc, char *argv[])
{
if(argc>1){
//User entered a valid file name
yyin=fopen(argv[1],"r");
yylex();
printf("******************************************************************\n");
}
else{
//User didnt enter a valid file name
printf("\n");
exit(0);
}
return 0;
}
并且分析器可以很好地处理文件中每个字符之间有空格的代码。
输入txt文件
#define B 0
#ifdef B
#if B > ( 0 + 1 > 5 )
main(){
printf(“Hola programa”)
}
控制台输出
[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[ >] is relational operator
[ (] is a special symbol
[ 0] is an integer constant
[ +] is arith operator
[ 1] is an integer constant
[ >] is relational operator
[ 5] is an integer constant
[)] is a special symbol
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze
但是,字符间没有空格的输入文件无法正常工作。
输入txt文件:
#define B 0
#ifdef B
#if B>(0+1)
main(){
printf(“Hola programa”)
}
控制台输出:
[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[>(] is a special symbol
[0+] is arith operator
[)] is a special symbol
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze
这是一个有趣的事实。当您跟踪生成的令牌时,您看到的是(经过大量编辑):
[ (] is a special symbol
[)] is a special symbol
为什么 ( 前面有一个 space,而不是 )?这是否与不适当的令牌有关:
[>(] is a special symbol
有了这个提示,我们再来看看SYMBOL
的定义。有一条规则:
{SYMBOL} { printf("[%s] is a special symbol \n",yytext); }
这取决于宏定义
SYMBOL {DIR}"("|")"
又引用宏 DIR
:
DIR [^#]
换句话说,macro-processing之后的结果大约是:
[^#]"("|")" { printf("[%s] is a special symbol \n",yytext); }
该规则将适用于两种可能性之一:
除 # 后跟 (
以外的任何字符
A )
该模式肯定与两个字符 (
以及单个字符 )
匹配。可能您还有一条规则可以丢弃白色 space,但由于 longest-match 规则,它不适用于 (
的情况。所以,事实上,这就解释了为什么左括号前面带有白色space。
它还解释了 #if B>(0+1)
的词法分析会发生什么。首先, #if
被识别。然后规则 [^#][A-Z]{1,8}
匹配,因为 [^#]
匹配一个 space。下一个字符是 >,它 不 匹配 [^#]">"|">="|"<"|"<="|"=="|"!="
因为 > 只会匹配在 # 以外的字符之后。另一方面,> 不是 #,因此该位置确实匹配 [^#]"("|")"
。 (比较如果输入 #if B>=(0+1)
会发生什么。)
这就解释了发生了什么。但是这些规则有意义吗?
我怀疑您认为 {DIR}
扩展会导致规则的其余部分仅适用于不以 # 开头的行. (f)lex regular expression syntax 中没有任何内容会暗示这种解释,而且我不知道有任何正则表达式语法可以工作。
(F)lex 确实有一个 mechanism 用于在不同的词汇上下文中使用不同的规则,这可能是您在这种情况下想要的。但该机制只能在规则中调用,不能在宏定义中调用。
值得阅读链接手册部分以获得完整说明;这是基于它的部分解决方案:
/* The various contexts for parsing preprocess directives. A full
* solution would have more of these.
*/
%x CPP CPP_IF CPP_IFDEF CPP_REST
%%
/* Anything which is not a preprocessor command
[[:blank:]]*[^#\n[:blank:]].* { printf("%s is code to analyse.\n"); }
/* cpp directives */
[[:blank:]]*#[[:blank:]]* { BEGIN(CPP); }
/* Anything else is a completely blank line. Ignore it and the trailing newline. */
.*\n { /* Ignore */ }
/* The first thing in a preprocessor line is normally the command */
* In a full solution, there would be different contexts for each
* command type; this is just a partial solution.
*/
<CPP>{
(el)?if { printf("#%s directive\n", yytext); BEGIN(CPP_IF); }
ifn?def { printf("#%s directive\n", yytext); BEGIN(CPP_IFDEF); }
else|endif { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
/* Other directives need to be added. */
/* Fallbacks */
[[:alpha:]][[:alnum:]]* { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
. { puts("Unknown # directive"); BEGIN(CPP_REST); }
\n { BEGIN(INITIAL); }
}
/* Context to just skip everything to the end of the pp directive */
<CPP_REST>(.|\\n)* { BEGIN(INITIAL); }
/* Constant expression context, for #if and #elif */
<CPP_IF>{
[[:digit:]]+ { printf("[%s] is an integer constant", yytext); }
[[:alpha:]_][[:alnum:]_]* { printf("[%s] is an identifier", yytext); }
[[:blank:]]* ;
[-+*/%!~|&]|"||"|"&&" { printf("[%s] is an arithmetic operator", yytext); }
[=<>!]=? { printf("[%s] is a relational operator", yytext); }
[()] { printf("[%s] is a parenthesis", yytext); }
. { printf("[%s] is unrecognized", yytext); }
\n { BEGIN(INITIAL); }
}
我正在尝试开发一个词法分析器来检测预处理器指令和 "code to analyze"。
我希望分析器检测处理器指令和标识符、整数常量等(但前提是这些元素位于处理器指令的同一行)和 "code to analyze"(不在同一行的行)指令行)。
例如,对于 txt 文件中的下一个代码,
#define B 0
#ifdef C
#if D > ( 0 + 1 )
main(){
printf(“Hello”);
}
我要检测以下元素
- 指令:#define、#ifdef、#if
- 标识符:B、C、D
- 整数常量:0、1
- 符号:( , )
- 关系运算符:>
- 算术运算符:+
- 要分析的代码:main(){, printf(“Hello”); , }
这是我实现分析器的代码:
%{
/*Libraries Declaration */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*Functions Headers */
/*Global variables */
%}
/** Regular Expressions Definitions */
TAB [ \t]+
DIG [0-9]
RESERV_WORD #define|#elif|#else|#endif|#if|#ifdef|#ifndef|#undef
DIR [^#]
OP_RELA {DIR}">"|">="|"<"|"<="|"=="|"!="
OP_ARIT {DIR}"+"|"-"|"*"|"/"|"%"
SYMBOL {DIR}"("|")"
INT_CTE {DIR}{DIG}+
SYMBOLYC_CTE {DIR}("\"")(.*)("\"")
IDENTIFIER {DIR}[A-Z]{1,8}
CODE_TO_ANALY ^[^#].*
/* Traduction rules*/
%option noyywrap
%%
{TAB} { }
{CODE_TO_ANALY} {
printf("[%s] is code to analyze\n",yytext);
}
{OP_RELA} { //Detect relational operators
printf("[%s] is relational operator\n",yytext);
}
{OP_ARIT} {
printf("[%s] is arith operator \n",yytext);
}
{RESERV_WORD} { //Detect reserved words
printf("[%s] is a reserved word\n",yytext);
}
{INT_CTE} { //Detect integer constants
printf("[%s] is an integer constant\n",yytext);
}
{SYMBOL} { //Detecta special symbols
printf("[%s] is a special symbol \n",yytext);
}
{SYMBOLYC_CTE} { //Detecta symbolic constants
printf("[%s] is a symbolic constant\n",yytext);
}
{IDENTIFIER} { //Detect identifiers
printf("[%s] is an identifier\n",yytext);
}
. {}
%%
int main(int argc, char *argv[])
{
if(argc>1){
//User entered a valid file name
yyin=fopen(argv[1],"r");
yylex();
printf("******************************************************************\n");
}
else{
//User didnt enter a valid file name
printf("\n");
exit(0);
}
return 0;
}
并且分析器可以很好地处理文件中每个字符之间有空格的代码。
输入txt文件
#define B 0
#ifdef B
#if B > ( 0 + 1 > 5 )
main(){
printf(“Hola programa”)
}
控制台输出
[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[ >] is relational operator
[ (] is a special symbol
[ 0] is an integer constant
[ +] is arith operator
[ 1] is an integer constant
[ >] is relational operator
[ 5] is an integer constant
[)] is a special symbol
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze
但是,字符间没有空格的输入文件无法正常工作。
输入txt文件:
#define B 0
#ifdef B
#if B>(0+1)
main(){
printf(“Hola programa”)
}
控制台输出:
[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[>(] is a special symbol
[0+] is arith operator
[)] is a special symbol
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze
这是一个有趣的事实。当您跟踪生成的令牌时,您看到的是(经过大量编辑):
[ (] is a special symbol
[)] is a special symbol
为什么 ( 前面有一个 space,而不是 )?这是否与不适当的令牌有关:
[>(] is a special symbol
有了这个提示,我们再来看看SYMBOL
的定义。有一条规则:
{SYMBOL} { printf("[%s] is a special symbol \n",yytext); }
这取决于宏定义
SYMBOL {DIR}"("|")"
又引用宏 DIR
:
DIR [^#]
换句话说,macro-processing之后的结果大约是:
[^#]"("|")" { printf("[%s] is a special symbol \n",yytext); }
该规则将适用于两种可能性之一:
除 # 后跟 (
以外的任何字符
A )
该模式肯定与两个字符 (
以及单个字符 )
匹配。可能您还有一条规则可以丢弃白色 space,但由于 longest-match 规则,它不适用于 (
的情况。所以,事实上,这就解释了为什么左括号前面带有白色space。
它还解释了 #if B>(0+1)
的词法分析会发生什么。首先, #if
被识别。然后规则 [^#][A-Z]{1,8}
匹配,因为 [^#]
匹配一个 space。下一个字符是 >,它 不 匹配 [^#]">"|">="|"<"|"<="|"=="|"!="
因为 > 只会匹配在 # 以外的字符之后。另一方面,> 不是 #,因此该位置确实匹配 [^#]"("|")"
。 (比较如果输入 #if B>=(0+1)
会发生什么。)
这就解释了发生了什么。但是这些规则有意义吗?
我怀疑您认为 {DIR}
扩展会导致规则的其余部分仅适用于不以 # 开头的行. (f)lex regular expression syntax 中没有任何内容会暗示这种解释,而且我不知道有任何正则表达式语法可以工作。
(F)lex 确实有一个 mechanism 用于在不同的词汇上下文中使用不同的规则,这可能是您在这种情况下想要的。但该机制只能在规则中调用,不能在宏定义中调用。
值得阅读链接手册部分以获得完整说明;这是基于它的部分解决方案:
/* The various contexts for parsing preprocess directives. A full
* solution would have more of these.
*/
%x CPP CPP_IF CPP_IFDEF CPP_REST
%%
/* Anything which is not a preprocessor command
[[:blank:]]*[^#\n[:blank:]].* { printf("%s is code to analyse.\n"); }
/* cpp directives */
[[:blank:]]*#[[:blank:]]* { BEGIN(CPP); }
/* Anything else is a completely blank line. Ignore it and the trailing newline. */
.*\n { /* Ignore */ }
/* The first thing in a preprocessor line is normally the command */
* In a full solution, there would be different contexts for each
* command type; this is just a partial solution.
*/
<CPP>{
(el)?if { printf("#%s directive\n", yytext); BEGIN(CPP_IF); }
ifn?def { printf("#%s directive\n", yytext); BEGIN(CPP_IFDEF); }
else|endif { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
/* Other directives need to be added. */
/* Fallbacks */
[[:alpha:]][[:alnum:]]* { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
. { puts("Unknown # directive"); BEGIN(CPP_REST); }
\n { BEGIN(INITIAL); }
}
/* Context to just skip everything to the end of the pp directive */
<CPP_REST>(.|\\n)* { BEGIN(INITIAL); }
/* Constant expression context, for #if and #elif */
<CPP_IF>{
[[:digit:]]+ { printf("[%s] is an integer constant", yytext); }
[[:alpha:]_][[:alnum:]_]* { printf("[%s] is an identifier", yytext); }
[[:blank:]]* ;
[-+*/%!~|&]|"||"|"&&" { printf("[%s] is an arithmetic operator", yytext); }
[=<>!]=? { printf("[%s] is a relational operator", yytext); }
[()] { printf("[%s] is a parenthesis", yytext); }
. { printf("[%s] is unrecognized", yytext); }
\n { BEGIN(INITIAL); }
}