使用 antlr4 处理平面文件

Question

我需要处理一个平面文本文件，我试图用 antlr4 生成一个解析器。文件格式如下：

文件可以包含多条记录
每行一条记录
每条记录有多个字段
字段数取决于记录类型
每条记录的总长度不固定，取决于各个字段的数量
记录类型由前 3 个字母数字元素定义
每个字段都有一个特定的起始位置（记录中的列）和一些元素

示例文件

ACF0000000101IAR
FAT0000000203IARGL9344KDKK
FAT0000000301IARGM

示例语法

grammar Cat;

file : record+ ;

record: (file_header | cycle_header);

file_header : 'ACF' FIELD1 FIELD2 FIELD3;
cycle_header : 'FAT' FIELD1 FIELD2;

FIELD1 : DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT DIGIT;
FIELD2 : DIGIT DIGIT;
FIELD3 : ALPHANUM ALPHANUM ALPHANUM;

fragment DIGIT: [0-9];
fragment ALPHANUM: [A-Za-z] | DIGIT | ' ';
fragment NEWLINE: '\n';

我遇到的这个语法问题是，当我检查树时，file_header 规则中的 FIELD2 不匹配，而是匹配 FIELD3。请记住 cycle_header

的语法不完整

我的预期是，由于 FIELD2 在 file_header 规则中位于 FIELD3 之前，因此这将匹配任何两位数字，其余字符将与 FIELD3 匹配，但情况并非如此，如图所示.

所以我的问题是：

Antlr4 适合解析这样的文件结构还是用正则表达式解析更合适
为什么FIELD3匹配在FIELD2之前，是不是我理解错了什么？

Answer 1

Is Antlr4 suitable for parsing such a file structure or some kind of parsing with regex would be more suitable

不，我同意 rici 的观点。他的评论真的应该是一个答案：

Antlr4 is probably not the optimal choice for this problem. The Antlr lexer is not really contextual, resulting in the problem you see; the lexer matches whichever lexical pattern has the longest match at the given input. You could use a scannerless approach, without lexical rules, but honestly you're probably better off just dividing the input line up with substring()

Why FIELD3 is matched before FIELD2, is there something i have misunderstood?

扩展 rici 的评论：ANTLR 的词法分析器不是 "driven by the parser"（词法分析器不会根据解析器试图匹配的内容生成标记）。词法分析器总是根据两个简单的规则创建标记：

尝试匹配尽可能多的字符
如果两个或多个词法分析器规则匹配相同的字符，则让第一个定义的规则"win"

由于规则 1，对于像 123 这样的输入，会在 FIELD2.

之前创建 FIELD3

使用 antlr4 处理平面文件

Processing flat file with antlr4

grammar

antlr4