使用 Antlr 解析具有多个语言环境的公式

Using Antlr to parse formulas with multiple locales

我是 Antlr 的新手,所以请原谅这个可能非常简单的问题。

我正在创建一个解析类似 Excel 的公式的语法,它需要支持基于列表分隔符(对于 en-US)和小数点分隔符(. 对于 en-US)的多个语言环境。我不想根据语言环境在单独的语法之间进行选择。

我可以通过修改或继承 CommonTokenStream class 来实现这一点,还是有其他方法可以做到这一点?示例会有所帮助。

我在我的 VS2015 C# 项目中使用 Antlr v4.5.0-alpha003 NuGet 包。

您可以做的是向您的词法分析器添加区域设置(或自定义分隔符和分组字符),并在检查您的自定义分隔符和分组字符并匹配这些标记的词法分析器规则之前添加语义谓词动态地。

我这里没有 ANTLR 和 C# 运行ning,但是 Java 演示应该非常相似:

grammar LocaleDemo;

@lexer::header {
  import java.text.DecimalFormatSymbols;
  import java.util.Locale;
}

@lexer::members {

  private char decimalSeparator = '.';
  private char groupingSeparator = ',';

  public LocaleDemoLexer(CharStream input, Locale locale) {
    this(input);
    DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
    this.decimalSeparator = dfs.getDecimalSeparator();
    this.groupingSeparator = dfs.getGroupingSeparator();
  }
}

parse
 : .*? EOF
 ;

NUMBER
 : D D? ( DG D D D )* ( DS D+ )?
 ;

OTHER
 : .
 ;

fragment D  : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}?  . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;

为了测试上面的语法,运行 这个 class:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;

public class Main {

    private static void tokenize(String input, Locale locale) {

        LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
        System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);

        for (Token t : lexer.getAllTokens()) {
            System.out.printf("  %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
        }
    }

    public static void main(String[] args) throws Exception {

        tokenize("1.23", Locale.ENGLISH);
        tokenize("1.23", Locale.GERMAN);

        tokenize("12.345.678,90", Locale.ENGLISH);
        tokenize("12.345.678,90", Locale.GERMAN);
    }
}

这将打印:

input='1.23', locale=en, tokens:
  NUMBER     '1.23'

input='1.23', locale=de, tokens:
  NUMBER     '1'
  OTHER      '.'
  NUMBER     '23'

input='12.345.678,90', locale=en, tokens:
  NUMBER     '12.345'
  OTHER      '.'
  NUMBER     '67'
  NUMBER     '8'
  OTHER      ','
  NUMBER     '90'

input='12.345.678,90', locale=de, tokens:
  NUMBER     '12.345.678,90'

相关问答:

  • What is a 'semantic predicate' in ANTLR?
  • What does "fragment" mean in ANTLR?

作为巴特回答的后续,这是我根据他的建议创建的语法:

grammar ExcelScript;



@lexer::header
{
using System;
using System.Globalization;
}

@lexer::members
{
    private Int32 listseparator = 44; // UTF16 value for comma
    private Int32 decimalseparator = 46; // UTF16 value for period

    /// <summary>
    /// Creates a new lexer object
    /// </summary>
    /// <param name="input">The input stream</param>
    /// <param name="locale">The locale to use in parsing numbers</param>
    /// <returns>A new lexer object</returns>
    public ExcelScriptLexer (ICharStream input, CultureInfo locale)
    : this(input)
    {
        this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
        this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);

        // special case for 8 locales where the list separator is a , and the number separator is a , too
        // Excel uses semicolon for list separator, so we will too
        if (this.listseparator == 44 && this.decimalseparator == 44)
            this.listseparator = 59; // UTF16 value for semicolon
    }
}


/*
 * Parser Rules
 */

formula
    :   numberLiteral
    |   Identifier
    |   '=' expression
    ;

expression
    :   primary                                     # PrimaryExpression
    |   Identifier arguments                                # FunctionCallExpression
    |   ('+' | '-') expression                              # UnarySignExpression
    |   expression ('*' | '/' | '%') expression                     # MulDivModExpression
    |   expression ('+' | '-') expression                       # AddSubExpression
    |   expression ('<=' | '>=' | '>' | '<') expression                 # CompareExpression
    |   expression ('=' | '<>') expression                      # EqualCompareExpression
    ;

primary
    :   '(' expression ')'                              # ParenExpression
    |   literal                                     # LiteralExpression
    |   Identifier                                  # IdentifierExpression
    ;

literal
    :   numberLiteral                                   # NumberLiteralRule
    |   booleanLiteral                                  # BooleanLiteralRule
    ;

numberLiteral
    :   IntegerLiteral
    |   FloatingPointLiteral
    ;

booleanLiteral
    :   TrueKeyword
    |   FalseKeyword
    ;

arguments
    :   '(' expressionList? ')'
    ;

expressionList
    :   expression (ListSeparator expression)*
    ;

/*
 * Lexer Rules
 */

AddOperator :   '+' ;
SubOperator :   '-' ;
MulOperator :   '*' ;
DivOperator :   '/' ;
PowOperator :   '^' ;
EqOperator  :   '=' ;
NeqOperator :   '<>' ;
LeOperator  :   '<=' ;
GeOperator  :   '>=' ;
LtOperator  :   '<' ;
GtOperator  :   '>' ;

ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;

TrueKeyword :   [Tt][Rr][Uu][Ee] ;
FalseKeyword    :   [Ff][Aa][Ll][Ss][Ee] ;

Identifier
    :   Letter (Letter | Digit)*
    ;

fragment Letter
    :   [A-Z_a-z]
    ;

fragment Digit
    :   [0-9]
    ;

IntegerLiteral
    :   '0'
    |   [1-9] [0-9]*
    ;

FloatingPointLiteral
    :   [0-9]+ DecimalSeparator [0-9]* Exponent?
    |   DecimalSeparator [0-9]+ Exponent?
    |   [0-9]+ Exponent
    ;

fragment Exponent
    :   ('e' | 'E') ('+' | '-')? ('0'..'9')+
    ;

WhiteSpace
    :   [ \t]+ -> channel(HIDDEN)
    ;