如何标记块（注释、字符串等）以及块间（块外的任何字符）？

Question

我需要标记所有 "outside" 任何评论，直到行尾。例如：

take me */ and me /* but not me! */ I'm in! // I'm not...

标记为（STR 是 "outside" 字符串，BC 是块注释，LC 是单行注释）：

{
    STR: "take me */ and me ", // note the "*/" in the string!
    BC : " but not me! ",
    STR: " I'm in! ",
    LC : " I'm not..."
}

并且：

/* starting with don't take me */ ...take me...

标记为：

{
    BC : " starting with don't take me ",
    STR: " ...take me..."
}

问题是 STR 可以是任何东西除了评论，并且由于评论开启符不是单个字符标记，我不能使用否定规则STR.

我想也许可以做这样的事情：

STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;

但我不知道如何在词法分析器操作中预测 个字符。

是否有可能使用 ANTLR4 词法分析器完成，如果是，那么如何？

Answer 1

尝试这样的事情：

grammar T;

@lexer::members {

  // Returns true iff either "//" or "/*"  is ahead in the char stream.
  boolean startCommentAhead() {
    return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
  }
}

// other rules

STR
 : ( {!startCommentAhead()}? . )+
 ;

Answer 2

是的，可以执行您正在尝试的标记化。

根据以上所述，您需要嵌套评论。这些只能在词法分析器中实现，无需 Action、Predicate 或任何代码。为了有嵌套的评论，如果你不使用 greedy/non-greedy ANTLR 选项，它会更容易。您将需要 specify/code 将其放入词法分析器语法中。以下是您需要的三个词法分析器规则......具有 STR 定义。

我添加了一个用于测试的解析器规则。我没有测试过这个，但它应该做你提到的一切。此外，它不限于 'end of line' 如果需要，您可以进行修改。

/*
    All 3 COMMENTS are Mutually Exclusive
 */
DOC_COMMENT
        : '/**'
          ( [*]* ~[*/]         // Cannot START/END Comment
            ( DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            | .
            )*?
          )?
          '*'+ '/' -> channel( DOC_COMMENT )
        ;
BLK_COMMENT
        : '/*'
          (
            ( /* Must never match an '*' in position 3 here, otherwise
                 there is a conflict with the definition of DOC_COMMENT
               */
              [/]? ~[*/]       // No START/END Comment
            | DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            )
            ( DOC_COMMENT
            | BLK_COMMENT
            | INL_COMMENT
            | .
            )*?
          )?
          '*/' -> channel( BLK_COMMENT )
        ;
INL_COMMENT
        : '//'
          ( ~[\n\r*/]          // No NEW_LINE
          | INL_COMMENT        // Nested Inline Comment
          )* -> channel( INL_COMMENT )
        ;
STR       // Consume everthing up to the start of a COMMENT
        : ( ~'/'      // Any Char not used to START a Comment
          | '/' ~[*/] // Cannot START a Comment
          )+
        ;

start
        : DOC_COMMENT
        | BLK_COMMENT
        | INL_COMMENT
        | STR
        ;

如何标记块（注释、字符串等）以及块间（块外的任何字符）？

How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)?

antlr

lexer

antlr4