ANTLR:对 bash 文件进行词法分析,尤其是 heredoc

ANTLR: lexing bash files, especially heredoc

我正在使用 ANTLR 来 lex bash 文件(用于语法着色)。是否可以使用 heredoc:

等动态结尾的 lex 规则
cat <<ENDTEXT
hello world, 
this text may contain 
any letters, even ' and "
ENDTEXT

cat <<FOO
here a different end-word
is used
FOO

这只有 predicate 才有可能。

这是一个简单的例子:

lexer grammar BashLexer;

@members {
  private boolean heredocEndAhead(String partialHeredoc) {
    if (this.getCharPositionInLine() != 0) {
      // If the lexer is not at the start of a line, no end-delimiter can be possible
      return false;
    }

    // Get the delimiter
    String firstLine = partialHeredoc.split("\r?\n|\r")[0];
    String delimiter = firstLine.replaceAll("^<<-?\s*", "");

    for (int n = 1; n < delimiter.length(); n++) {
      if (this._input.LA(n) != delimiter.charAt(n - 1)) {
        return false;
      }
    }

    // If we get to this point, we know there is an end delimiter ahead in the char stream, make
    // sure it is followed by a white space (or the EOF). If we don't do this, then "FOOS" would also
    // be considered the end for the delimiter "FOO"
    int charAfterDelimiter = this._input.LA(delimiter.length() + 1);

    return charAfterDelimiter == EOF ||  Character.isWhitespace(charAfterDelimiter);
  }
}

HEREDOC
 : '<<' '-'? [ \t]* [a-zA-Z_] [a-zA-Z_0-9]* NL ( {!heredocEndAhead(getText())}? . )* [a-zA-Z_] [a-zA-Z_0-9]*
 ;

ANY
 : .
 ;

fragment NL
 : '\r'? '\n'
 | '\r'
 ;

这将标记输入:

cat <<ENDTEXT
hello world, 
ENDTEXTS ENDTEXT
this text may contain 
any letters, even ' and "
ENDTEXT

像这样:

ANY      `c`
ANY      `a`
ANY      `t`
ANY      ` `
HEREDOC  `<<ENDTEXT\nhello world, \nENDTEXTS ENDTEXT\nthis text may contain \nany letters, even ' and "\nENDTEXT`
EOF      `<EOF>`