ANTLR4

Question

我正在使用 ANTLR4 为某些 JavaScript 预处理器生成词法分析器（基本上它会标记 javascript 文件并提取每个字符串文字）。

我使用了原先为 Antlr3 编写的语法，并为 v4 导入了相关部分（仅词法分析器规则）。

我只剩下一个问题：我不知道如何处理 RegEx 文字的极端情况，如下所示：

log(Math.round(v * 100) / 100 + ' msec/sample');

/ 100 + ' msec/ 被解释为 RegEx 文字，因为词法分析器规则始终处于活动状态。

我想要的是合并此逻辑（C# 代码。我需要 JavaScript，但只是我不知道如何调整它）：

    /// <summary>
    /// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled.
    /// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token.
    /// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true.
    /// </summary>
    private bool AreRegularExpressionsEnabled
    {
        get
        {
            if (Last == null)
            {
                return true;
            }

            switch (Last.Type)
            {
                // identifier
                case Identifier:
                // literals
                case NULL:
                case TRUE:
                case FALSE:
                case THIS:
                case OctalIntegerLiteral:
                case DecimalLiteral:
                case HexIntegerLiteral:
                case StringLiteral:
                // member access ending 
                case RBRACK:
                // function call or nested expression ending
                case RPAREN:
                    return false;

                // otherwise OK
                default:
                    return true;
            }
        }
    }

此规则作为内联谓词存在于旧语法中，如下所示：

RegularExpressionLiteral
    : { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

但是我不知道如何在ANTLR4中使用这个技术

在ANTLR4书中，有一些关于在解析器级别解决此类问题的建议（第12.2章-上下文相关的词法问题），但我不想使用解析器。我只想提取所有标记，保留除字符串文字以外的所有内容，并避免解析。

非常感谢任何建议，谢谢！

Answer 1

我在这里发布了最终的解决方案，开发了使现有的解决方案适应 ANTLR4 的新语法，并解决了 JavaScript 语法中的差异。

我只发布相关部分，为其他人提供有关工作策略的线索。

规则编辑如下：

RegularExpressionLiteral
    : DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;

函数 isRegExEnabled 在词法分析器语法顶部的 @members 部分中定义，如下所示：

@members {
EcmaScriptLexer.prototype.nextToken = function() {
  var result = antlr4.Lexer.prototype.nextToken.call(this, arguments);
  if (result.channel !== antlr4.Lexer.HIDDEN) {
    this._Last = result;
  }

  return result;
}

EcmaScriptLexer.prototype.isRegExEnabled = function() {
  var la = this._Last ? this._Last.type : null;
  return la !== EcmaScriptLexer.Identifier &&
    la !== EcmaScriptLexer.NULL &&
    la !== EcmaScriptLexer.TRUE &&
    la !== EcmaScriptLexer.FALSE &&
    la !== EcmaScriptLexer.THIS &&
    la !== EcmaScriptLexer.OctalIntegerLiteral &&
    la !== EcmaScriptLexer.DecimalLiteral &&
    la !== EcmaScriptLexer.HexIntegerLiteral &&
    la !== EcmaScriptLexer.StringLiteral &&
    la !== EcmaScriptLexer.RBRACK &&
    la !== EcmaScriptLexer.RPAREN;
}}

可以看到，定义了两个函数，一个是对lexer的nextToken方法的重写，将已有的nextToken包装起来，保存最后一个非注释或空白的token以供参考。然后，语义谓词调用 isRegExEnabled 检查最后一个有效标记是否与 RegEx 文字的存在兼容。如果不是，则 returns 错误。

感谢 Lucas Trzesniewski 的评论：它为我指明了正确的方向，感谢 Patrick Hulsmeijer 对 v3 的原创工作。

ANTLR4 - 在 JavaScript 语法中解析正则表达式文字

ANTLR4 - parsing regex literals in JavaScript grammar

parsing

lexer