Antlr4 for Python:将数据解析成多个部分

Antlr4 for Python: Parse data into parts

我有一个非常简单的 Antlr4 语法:

grammar settings;

query
    : COLUMN OPERATOR (SETTING|SCALAR)
    ;

COLUMN
    : [a-z_]+
    ;

OPERATOR
    : ('='|'>'|'<')
    ;

SETTING
    : 'setting(' [a-z_]+ ')'
    ;

SCALAR
    : [a-z_]+
    ;

我想像 total_sales>setting(min_total_sales) 这样的输入字符串(它们代表数据库列名、运算符和值)定义什么是列名、运算符和值。为此开发了一些 python 代码:

import re

from antlr4 import InputStream, CommonTokenStream

from settingsLexer import settingsLexer
from settingsParser import settingsParser

settings = {
    'min_total_sales': 1000
}

conditions = 'total_sales>setting(min_total_sales)'

lexer = settingsLexer(InputStream(conditions))
stream = CommonTokenStream(lexer)
parser = settingsParser(stream)
tree = parser.query()

regex = re.compile('^setting\((?P<setting_name>[a-z_]+)\)$')

column = None
operator = None
value = None

for child in tree.getChildren():
    text = child.getText()

    # how to match what is child: column or operator or value???

    # this for value defining
    if match := regex.match(text):
        setting_name = match.group('setting_name')
        print(f'We should get value from setting named `{setting_name}`')
        min_total_sales = settings['min_total_sales']
    else:
        print(f'We got a simple scalar value: {text}')
        min_total_sales = int(text)

如何匹配子项:列名或运算符或值?

为什么要涉及正则表达式?解析输入后,tree 结构将包含与其匹配的规则相对应的方法。所以,parser.query()返回的对象,就是解析器规则:

query
    : COLUMN OPERATOR (SETTING|SCALAR)
    ;

将有 4 种方法:COLUMN()OPERATOR()SETTING()SCALAR()

使用它们提取您想要的数据:

tree = parser.query()

column = tree.COLUMN()
operator = tree.OPERATOR()
setting = tree.SETTING()

print(f"column={column}, operator={operator}, setting={setting}")

而且我不会将 settingmin_total_sales 粘合到 1 个大标记中,而是由解析器来完成。否则像 total_sales>setting ( min_total_sales ) 这样的输入将因为空格而无法匹配。

grammar settings;

query
    : COLUMN OPERATOR value EOF
    ;

value
    : setting
    | SCALAR
    ;

setting
    : SETTING '(' SCALAR ')'
    ;

COLUMN
    : [a-z_]+
    ;

OPERATOR
    : ('='|'>'|'<')
    ;

SETTING
    : 'setting'
    ;

SCALAR
    : [a-z_]+
    ;

SPACES
    : [ \t\r\n] -> skip
    ;