柔性。在预处理器指令之后检测字符

Question

我正在尝试开发一个词法分析器来检测预处理器指令和 "code to analyze"。

我希望分析器检测处理器指令和标识符、整数常量等（但前提是这些元素位于处理器指令的同一行）和 "code to analyze"（不在同一行的行）指令行）。

例如，对于 txt 文件中的下一个代码，

#define B 0
#ifdef C
#if D > ( 0 + 1 )
main(){
printf(“Hello”);
}

我要检测以下元素

指令：#define、#ifdef、#if
标识符：B、C、D
整数常量：0、1
符号：( , )
关系运算符：>
算术运算符：+
要分析的代码：main(){, printf(“Hello”); , }

这是我实现分析器的代码：

%{
    /*Libraries Declaration */
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    /*Functions Headers */

    /*Global variables */

%}

/** Regular Expressions Definitions */

TAB [ \t]+
DIG [0-9]
RESERV_WORD #define|#elif|#else|#endif|#if|#ifdef|#ifndef|#undef

DIR [^#]
OP_RELA {DIR}">"|">="|"<"|"<="|"=="|"!="
OP_ARIT {DIR}"+"|"-"|"*"|"/"|"%"
SYMBOL  {DIR}"("|")"
INT_CTE {DIR}{DIG}+
SYMBOLYC_CTE {DIR}("\"")(.*)("\"")
IDENTIFIER {DIR}[A-Z]{1,8}
CODE_TO_ANALY ^[^#].*
/* Traduction rules*/
%option noyywrap
%%
{TAB}    { }
{CODE_TO_ANALY} {
  printf("[%s] is code to analyze\n",yytext);

}

{OP_RELA}       {           //Detect relational operators
            printf("[%s] is relational operator\n",yytext);
        }

{OP_ARIT}   {
            printf("[%s] is arith operator \n",yytext);
        }

{RESERV_WORD}       {       //Detect reserved words
            printf("[%s] is a reserved word\n",yytext);
        }

{INT_CTE}       {               //Detect integer constants
            printf("[%s] is an integer constant\n",yytext);
        }

{SYMBOL}    { //Detecta special symbols
    printf("[%s] is a special symbol \n",yytext);
}

{SYMBOLYC_CTE}  { //Detecta symbolic constants
            printf("[%s] is a symbolic constant\n",yytext);
        }

{IDENTIFIER}    { //Detect identifiers
            printf("[%s] is an identifier\n",yytext);
}



. {}


%%

int main(int argc, char *argv[])
{
    if(argc>1){
        //User entered a valid file name

        yyin=fopen(argv[1],"r");
        yylex();

        printf("******************************************************************\n");
    }
    else{
        //User didnt enter a valid file name

        printf("\n");
        exit(0);
    }

    return 0;
}

并且分析器可以很好地处理文件中每个字符之间有空格的代码。

输入txt文件

#define B 0
#ifdef B
#if B > ( 0 + 1 > 5 )
main(){
printf(“Hola programa”)
        }

控制台输出

    [#define] is a reserved word
    [ B] is an identifier
    [ 0] is an integer constant
    [#ifdef] is a reserved word
    [ B] is an identifier
    [#if] is a reserved word
    [ B] is an identifier
    [ >] is relational operator
    [ (] is a special symbol 
    [ 0] is an integer constant
    [ +] is arith operator 
    [ 1] is an integer constant
    [ >] is relational operator
    [ 5] is an integer constant
    [)] is a special symbol 
    [main(){] is code to analyze
    [printf(“Hola programa”)] is code to analyze
    [}] is code to analyze

但是，字符间没有空格的输入文件无法正常工作。

输入txt文件：

#define B 0
#ifdef B
#if B>(0+1)
main(){
printf(“Hola programa”)
}

控制台输出：

[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[>(] is a special symbol 
[0+] is arith operator 
[)] is a special symbol 
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze

Answer 1

这是一个有趣的事实。当您跟踪生成的令牌时，您看到的是（经过大量编辑）：

[ (] is a special symbol 
[)] is a special symbol

为什么 ( 前面有一个 space，而不是 )？这是否与不适当的令牌有关：

[>(] is a special symbol

有了这个提示，我们再来看看SYMBOL的定义。有一条规则：

{SYMBOL}    { printf("[%s] is a special symbol \n",yytext); }

这取决于宏定义

SYMBOL  {DIR}"("|")"

又引用宏 DIR:

DIR [^#]

换句话说，macro-processing之后的结果大约是：

[^#]"("|")" { printf("[%s] is a special symbol \n",yytext); }

该规则将适用于两种可能性之一：

除 # 后跟 (
A )

该模式肯定与两个字符 ( 以及单个字符 ) 匹配。可能您还有一条规则可以丢弃白色 space，但由于 longest-match 规则，它不适用于 ( 的情况。所以，事实上，这就解释了为什么左括号前面带有白色space。

它还解释了 #if B>(0+1) 的词法分析会发生什么。首先， #if 被识别。然后规则 [^#][A-Z]{1,8} 匹配，因为 [^#] 匹配一个 space。下一个字符是 >，它不匹配 [^#]">"|">="|"<"|"<="|"=="|"!=" 因为 > 只会匹配在 # 以外的字符之后。另一方面，> 不是 #，因此该位置确实匹配 [^#]"("|")"。（比较如果输入 #if B>=(0+1) 会发生什么。）

这就解释了发生了什么。但是这些规则有意义吗？

我怀疑您认为 {DIR} 扩展会导致规则的其余部分仅适用于不以 # 开头的行. (f)lex regular expression syntax 中没有任何内容会暗示这种解释，而且我不知道有任何正则表达式语法可以工作。

(F)lex 确实有一个 mechanism 用于在不同的词汇上下文中使用不同的规则，这可能是您在这种情况下想要的。但该机制只能在规则中调用，不能在宏定义中调用。

值得阅读链接手册部分以获得完整说明；这是基于它的部分解决方案：

 /* The various contexts for parsing preprocess directives. A full
  * solution would have more of these.
  */
%x CPP CPP_IF CPP_IFDEF CPP_REST
%%
  /* Anything which is not a preprocessor command
[[:blank:]]*[^#\n[:blank:]].*      { printf("%s is code to analyse.\n"); }
  /* cpp directives */
[[:blank:]]*#[[:blank:]]*          { BEGIN(CPP); }
  /* Anything else is a completely blank line. Ignore it and the trailing newline. */
.*\n                     { /* Ignore */ }
  /* The first thing in a preprocessor line is normally the command */
   * In a full solution, there would be different contexts for each
   * command type; this is just a partial solution.
   */
<CPP>{
    (el)?if              { printf("#%s directive\n", yytext); BEGIN(CPP_IF); }
    ifn?def              { printf("#%s directive\n", yytext); BEGIN(CPP_IFDEF); }
    else|endif           { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
    /* Other directives need to be added. */
    /* Fallbacks */
    [[:alpha:]][[:alnum:]]* { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
    .                    { puts("Unknown # directive"); BEGIN(CPP_REST); }
    \n                   { BEGIN(INITIAL); }
}
  /* Context to just skip everything to the end of the pp directive */
<CPP_REST>(.|\\n)*      { BEGIN(INITIAL); }
  /* Constant expression context, for #if and #elif */
<CPP_IF>{
    [[:digit:]]+         { printf("[%s] is an integer constant", yytext); }
    [[:alpha:]_][[:alnum:]_]* { printf("[%s] is an identifier", yytext); }
    [[:blank:]]*         ;
    [-+*/%!~|&]|"||"|"&&" { printf("[%s] is an arithmetic operator", yytext); }
    [=<>!]=?             { printf("[%s] is a relational operator", yytext); }
    [()]                 { printf("[%s] is a parenthesis", yytext); }
    .                    { printf("[%s] is unrecognized", yytext); }
    \n                   { BEGIN(INITIAL); }
}

柔性。在预处理器指令之后检测字符

Flex. Detect characters after preprocessor directives

lex

preprocessor-directive

flex-lexer