ANTLR4 解析带有“.*”的文本文件时出现问题?解析表达式
ANTLR4 Problems parsing text file with ".*?" parsing expression
我在解析这个文件时遇到了一些问题:
TYPE "Frequ"
VERSION : 0.1
// blah blah ...
INSIDE
Clk; // Clocks information
Imp; // Impulse information
END_INSIDE
END_TYPE
我使用的语法文件:
grammar gr;
type : 'TYPE' .*? 'END_TYPE';
我“只是”想获取“TYPE”和“END_TYPE”部分之间的所有内容。
难道这不可能吗?
我从命令行得到的错误:
行 1:0 缺少 'TYPE' at 'TYPE "Frequ"\r\nVERSION : 0.1\r\n// blah blah ...\r\
n INSIDE\r\n 时钟; // 时钟 information\r\n Imp; // 脉冲信息
ion\r\n END_INSIDE\r\n\r\nEND_TYPE'
提前致谢。
-斯蒂尔尼
当您在解析器规则中使用 .
时,它表示“匹配任何标记”。给定语法:
grammar T;
parse : . ;
A : 'aaa';
B : 'bbb';
.
将仅匹配标记 A
和 B
。
所以你需要定义一个词法分析器规则:
grammar T;
parse : TYPE;
TYPE
: 'TYPE' .*? 'END_TYPE'
;
在词法分析器规则中,.
匹配任何字符。
您将 运行 涉及的基本问题是 ANTLR 如何标记输入流以及它如何解析 conflicting/ambiguous 词法分析器规则。
如果两个规则匹配一串字符:
- 如果一串字符比另一串长,则最长的字符串获胜。
- 如果它们的长度相同,则第一个定义优先。
理想情况下,您希望 Lexer 规则匹配所有内容,但不包括 END_TYPE
不幸的是,对于 Lexer 规则,~
运算符仅适用于集合,并且集合成员只是单个字符,因此无法说“除了 END_TYPE
之外的所有内容。(我尝试了一个谓词,但它在较长的匹配中包含了 END_TYPE
的 END_TYP
部分。)
使用此语法:
grammar gr;
type : 'TYPE' OTHER* 'END_TYPE';
OTHER: .;
您可以获得一个带有 TYPE
和 END_TYPE
标记的解析树,但是它们之间的每个字符都是一个单独的标记:
[@0,0:3='TYPE',<'TYPE'>,1:0]
[@1,4:4=' ',<OTHER>,1:4]
[@2,5:5='"',<OTHER>,1:5]
[@3,6:6='F',<OTHER>,1:6]
[@4,7:7='r',<OTHER>,1:7]
[@5,8:8='e',<OTHER>,1:8]
[@6,9:9='q',<OTHER>,1:9]
[@7,10:10='u',<OTHER>,1:10]
[@8,11:11='"',<OTHER>,1:11]
[@9,12:12='\n',<OTHER>,1:12]
[@10,13:13='V',<OTHER>,2:0]
[@11,14:14='E',<OTHER>,2:1]
[@12,15:15='R',<OTHER>,2:2]
[@13,16:16='S',<OTHER>,2:3]
[@14,17:17='I',<OTHER>,2:4]
[@15,18:18='O',<OTHER>,2:5]
[@16,19:19='N',<OTHER>,2:6]
[@17,20:20=' ',<OTHER>,2:7]
[@18,21:21=':',<OTHER>,2:8]
[@19,22:22=' ',<OTHER>,2:9]
[@20,23:23='0',<OTHER>,2:10]
[@21,24:24='.',<OTHER>,2:11]
[@22,25:25='1',<OTHER>,2:12]
[@23,26:26='\n',<OTHER>,2:13]
[@24,27:27='/',<OTHER>,3:0]
[@25,28:28='/',<OTHER>,3:1]
[@26,29:29=' ',<OTHER>,3:2]
[@27,30:30='b',<OTHER>,3:3]
[@28,31:31='l',<OTHER>,3:4]
[@29,32:32='a',<OTHER>,3:5]
[@30,33:33='h',<OTHER>,3:6]
[@31,34:34=' ',<OTHER>,3:7]
[@32,35:35='b',<OTHER>,3:8]
[@33,36:36='l',<OTHER>,3:9]
[@34,37:37='a',<OTHER>,3:10]
[@35,38:38='h',<OTHER>,3:11]
[@36,39:39=' ',<OTHER>,3:12]
[@37,40:40='.',<OTHER>,3:13]
[@38,41:41='.',<OTHER>,3:14]
[@39,42:42='.',<OTHER>,3:15]
[@40,43:43='\n',<OTHER>,3:16]
[@41,44:44=' ',<OTHER>,4:0]
[@42,45:45=' ',<OTHER>,4:1]
[@43,46:46=' ',<OTHER>,4:2]
[@44,47:47='I',<OTHER>,4:3]
[@45,48:48='N',<OTHER>,4:4]
[@46,49:49='S',<OTHER>,4:5]
[@47,50:50='I',<OTHER>,4:6]
[@48,51:51='D',<OTHER>,4:7]
[@49,52:52='E',<OTHER>,4:8]
[@50,53:53='\n',<OTHER>,4:9]
[@51,54:54=' ',<OTHER>,5:0]
[@52,55:55=' ',<OTHER>,5:1]
[@53,56:56=' ',<OTHER>,5:2]
[@54,57:57=' ',<OTHER>,5:3]
[@55,58:58=' ',<OTHER>,5:4]
[@56,59:59=' ',<OTHER>,5:5]
[@57,60:60='C',<OTHER>,5:6]
[@58,61:61='l',<OTHER>,5:7]
[@59,62:62='k',<OTHER>,5:8]
[@60,63:63=';',<OTHER>,5:9]
[@61,64:64=' ',<OTHER>,5:10]
[@62,65:65='/',<OTHER>,5:11]
[@63,66:66='/',<OTHER>,5:12]
[@64,67:67=' ',<OTHER>,5:13]
[@65,68:68='C',<OTHER>,5:14]
[@66,69:69='l',<OTHER>,5:15]
[@67,70:70='o',<OTHER>,5:16]
[@68,71:71='c',<OTHER>,5:17]
[@69,72:72='k',<OTHER>,5:18]
[@70,73:73='s',<OTHER>,5:19]
[@71,74:74=' ',<OTHER>,5:20]
[@72,75:75='i',<OTHER>,5:21]
[@73,76:76='n',<OTHER>,5:22]
[@74,77:77='f',<OTHER>,5:23]
[@75,78:78='o',<OTHER>,5:24]
[@76,79:79='r',<OTHER>,5:25]
[@77,80:80='m',<OTHER>,5:26]
[@78,81:81='a',<OTHER>,5:27]
[@79,82:82='t',<OTHER>,5:28]
[@80,83:83='i',<OTHER>,5:29]
[@81,84:84='o',<OTHER>,5:30]
[@82,85:85='n',<OTHER>,5:31]
[@83,86:86='\n',<OTHER>,5:32]
[@84,87:87=' ',<OTHER>,6:0]
[@85,88:88=' ',<OTHER>,6:1]
[@86,89:89=' ',<OTHER>,6:2]
[@87,90:90=' ',<OTHER>,6:3]
[@88,91:91=' ',<OTHER>,6:4]
[@89,92:92=' ',<OTHER>,6:5]
[@90,93:93='I',<OTHER>,6:6]
[@91,94:94='m',<OTHER>,6:7]
[@92,95:95='p',<OTHER>,6:8]
[@93,96:96=';',<OTHER>,6:9]
[@94,97:97=' ',<OTHER>,6:10]
[@95,98:98='/',<OTHER>,6:11]
[@96,99:99='/',<OTHER>,6:12]
[@97,100:100=' ',<OTHER>,6:13]
[@98,101:101='I',<OTHER>,6:14]
[@99,102:102='m',<OTHER>,6:15]
[@100,103:103='p',<OTHER>,6:16]
[@101,104:104='u',<OTHER>,6:17]
[@102,105:105='l',<OTHER>,6:18]
[@103,106:106='s',<OTHER>,6:19]
[@104,107:107='e',<OTHER>,6:20]
[@105,108:108=' ',<OTHER>,6:21]
[@106,109:109='i',<OTHER>,6:22]
[@107,110:110='n',<OTHER>,6:23]
[@108,111:111='f',<OTHER>,6:24]
[@109,112:112='o',<OTHER>,6:25]
[@110,113:113='r',<OTHER>,6:26]
[@111,114:114='m',<OTHER>,6:27]
[@112,115:115='a',<OTHER>,6:28]
[@113,116:116='t',<OTHER>,6:29]
[@114,117:117='i',<OTHER>,6:30]
[@115,118:118='o',<OTHER>,6:31]
[@116,119:119='n',<OTHER>,6:32]
[@117,120:120='\n',<OTHER>,6:33]
[@118,121:121=' ',<OTHER>,7:0]
[@119,122:122=' ',<OTHER>,7:1]
[@120,123:123=' ',<OTHER>,7:2]
[@121,124:124='E',<OTHER>,7:3]
[@122,125:125='N',<OTHER>,7:4]
[@123,126:126='D',<OTHER>,7:5]
[@124,127:127='_',<OTHER>,7:6]
[@125,128:128='I',<OTHER>,7:7]
[@126,129:129='N',<OTHER>,7:8]
[@127,130:130='S',<OTHER>,7:9]
[@128,131:131='I',<OTHER>,7:10]
[@129,132:132='D',<OTHER>,7:11]
[@130,133:133='E',<OTHER>,7:12]
[@131,134:134='\n',<OTHER>,7:13]
[@132,135:135='\n',<OTHER>,8:0]
[@133,136:143='END_TYPE',<'END_TYPE'>,9:0]
[@134,144:143='<EOF>',<EOF>,9:8]
这可能效率很低,但并不 相当 看起来那么糟糕;您不必将所有其他标记连接在一起。在 Listener 中,您可以执行以下操作:
@Override
public void exitType(SimpleParser.TypeContext ctx) {
String text = ts.getText(
ctx.OTHER(0).getSymbol(),
ctx.OTHER(ctx.OTHER().size() - 1).getSymbol()
);
System.out.println(text);
}
其中 ts
是您的 TokenStream(您需要将其作为您的 Listener 中的成员变量,并填充它)。
可能 performant/flexible 进一步充实您的标记化(即使使用一些非常简单的词法分析器规则)以减少标记的数量。
我在解析这个文件时遇到了一些问题:
TYPE "Frequ"
VERSION : 0.1
// blah blah ...
INSIDE
Clk; // Clocks information
Imp; // Impulse information
END_INSIDE
END_TYPE
我使用的语法文件:
grammar gr;
type : 'TYPE' .*? 'END_TYPE';
我“只是”想获取“TYPE”和“END_TYPE”部分之间的所有内容。 难道这不可能吗?
我从命令行得到的错误:
行 1:0 缺少 'TYPE' at 'TYPE "Frequ"\r\nVERSION : 0.1\r\n// blah blah ...\r\ n INSIDE\r\n 时钟; // 时钟 information\r\n Imp; // 脉冲信息 ion\r\n END_INSIDE\r\n\r\nEND_TYPE'
提前致谢。
-斯蒂尔尼
当您在解析器规则中使用 .
时,它表示“匹配任何标记”。给定语法:
grammar T;
parse : . ;
A : 'aaa';
B : 'bbb';
.
将仅匹配标记 A
和 B
。
所以你需要定义一个词法分析器规则:
grammar T;
parse : TYPE;
TYPE
: 'TYPE' .*? 'END_TYPE'
;
在词法分析器规则中,.
匹配任何字符。
您将 运行 涉及的基本问题是 ANTLR 如何标记输入流以及它如何解析 conflicting/ambiguous 词法分析器规则。
如果两个规则匹配一串字符:
- 如果一串字符比另一串长,则最长的字符串获胜。
- 如果它们的长度相同,则第一个定义优先。
理想情况下,您希望 Lexer 规则匹配所有内容,但不包括 END_TYPE
不幸的是,对于 Lexer 规则,~
运算符仅适用于集合,并且集合成员只是单个字符,因此无法说“除了 END_TYPE
之外的所有内容。(我尝试了一个谓词,但它在较长的匹配中包含了 END_TYPE
的 END_TYP
部分。)
使用此语法:
grammar gr;
type : 'TYPE' OTHER* 'END_TYPE';
OTHER: .;
您可以获得一个带有 TYPE
和 END_TYPE
标记的解析树,但是它们之间的每个字符都是一个单独的标记:
[@0,0:3='TYPE',<'TYPE'>,1:0]
[@1,4:4=' ',<OTHER>,1:4]
[@2,5:5='"',<OTHER>,1:5]
[@3,6:6='F',<OTHER>,1:6]
[@4,7:7='r',<OTHER>,1:7]
[@5,8:8='e',<OTHER>,1:8]
[@6,9:9='q',<OTHER>,1:9]
[@7,10:10='u',<OTHER>,1:10]
[@8,11:11='"',<OTHER>,1:11]
[@9,12:12='\n',<OTHER>,1:12]
[@10,13:13='V',<OTHER>,2:0]
[@11,14:14='E',<OTHER>,2:1]
[@12,15:15='R',<OTHER>,2:2]
[@13,16:16='S',<OTHER>,2:3]
[@14,17:17='I',<OTHER>,2:4]
[@15,18:18='O',<OTHER>,2:5]
[@16,19:19='N',<OTHER>,2:6]
[@17,20:20=' ',<OTHER>,2:7]
[@18,21:21=':',<OTHER>,2:8]
[@19,22:22=' ',<OTHER>,2:9]
[@20,23:23='0',<OTHER>,2:10]
[@21,24:24='.',<OTHER>,2:11]
[@22,25:25='1',<OTHER>,2:12]
[@23,26:26='\n',<OTHER>,2:13]
[@24,27:27='/',<OTHER>,3:0]
[@25,28:28='/',<OTHER>,3:1]
[@26,29:29=' ',<OTHER>,3:2]
[@27,30:30='b',<OTHER>,3:3]
[@28,31:31='l',<OTHER>,3:4]
[@29,32:32='a',<OTHER>,3:5]
[@30,33:33='h',<OTHER>,3:6]
[@31,34:34=' ',<OTHER>,3:7]
[@32,35:35='b',<OTHER>,3:8]
[@33,36:36='l',<OTHER>,3:9]
[@34,37:37='a',<OTHER>,3:10]
[@35,38:38='h',<OTHER>,3:11]
[@36,39:39=' ',<OTHER>,3:12]
[@37,40:40='.',<OTHER>,3:13]
[@38,41:41='.',<OTHER>,3:14]
[@39,42:42='.',<OTHER>,3:15]
[@40,43:43='\n',<OTHER>,3:16]
[@41,44:44=' ',<OTHER>,4:0]
[@42,45:45=' ',<OTHER>,4:1]
[@43,46:46=' ',<OTHER>,4:2]
[@44,47:47='I',<OTHER>,4:3]
[@45,48:48='N',<OTHER>,4:4]
[@46,49:49='S',<OTHER>,4:5]
[@47,50:50='I',<OTHER>,4:6]
[@48,51:51='D',<OTHER>,4:7]
[@49,52:52='E',<OTHER>,4:8]
[@50,53:53='\n',<OTHER>,4:9]
[@51,54:54=' ',<OTHER>,5:0]
[@52,55:55=' ',<OTHER>,5:1]
[@53,56:56=' ',<OTHER>,5:2]
[@54,57:57=' ',<OTHER>,5:3]
[@55,58:58=' ',<OTHER>,5:4]
[@56,59:59=' ',<OTHER>,5:5]
[@57,60:60='C',<OTHER>,5:6]
[@58,61:61='l',<OTHER>,5:7]
[@59,62:62='k',<OTHER>,5:8]
[@60,63:63=';',<OTHER>,5:9]
[@61,64:64=' ',<OTHER>,5:10]
[@62,65:65='/',<OTHER>,5:11]
[@63,66:66='/',<OTHER>,5:12]
[@64,67:67=' ',<OTHER>,5:13]
[@65,68:68='C',<OTHER>,5:14]
[@66,69:69='l',<OTHER>,5:15]
[@67,70:70='o',<OTHER>,5:16]
[@68,71:71='c',<OTHER>,5:17]
[@69,72:72='k',<OTHER>,5:18]
[@70,73:73='s',<OTHER>,5:19]
[@71,74:74=' ',<OTHER>,5:20]
[@72,75:75='i',<OTHER>,5:21]
[@73,76:76='n',<OTHER>,5:22]
[@74,77:77='f',<OTHER>,5:23]
[@75,78:78='o',<OTHER>,5:24]
[@76,79:79='r',<OTHER>,5:25]
[@77,80:80='m',<OTHER>,5:26]
[@78,81:81='a',<OTHER>,5:27]
[@79,82:82='t',<OTHER>,5:28]
[@80,83:83='i',<OTHER>,5:29]
[@81,84:84='o',<OTHER>,5:30]
[@82,85:85='n',<OTHER>,5:31]
[@83,86:86='\n',<OTHER>,5:32]
[@84,87:87=' ',<OTHER>,6:0]
[@85,88:88=' ',<OTHER>,6:1]
[@86,89:89=' ',<OTHER>,6:2]
[@87,90:90=' ',<OTHER>,6:3]
[@88,91:91=' ',<OTHER>,6:4]
[@89,92:92=' ',<OTHER>,6:5]
[@90,93:93='I',<OTHER>,6:6]
[@91,94:94='m',<OTHER>,6:7]
[@92,95:95='p',<OTHER>,6:8]
[@93,96:96=';',<OTHER>,6:9]
[@94,97:97=' ',<OTHER>,6:10]
[@95,98:98='/',<OTHER>,6:11]
[@96,99:99='/',<OTHER>,6:12]
[@97,100:100=' ',<OTHER>,6:13]
[@98,101:101='I',<OTHER>,6:14]
[@99,102:102='m',<OTHER>,6:15]
[@100,103:103='p',<OTHER>,6:16]
[@101,104:104='u',<OTHER>,6:17]
[@102,105:105='l',<OTHER>,6:18]
[@103,106:106='s',<OTHER>,6:19]
[@104,107:107='e',<OTHER>,6:20]
[@105,108:108=' ',<OTHER>,6:21]
[@106,109:109='i',<OTHER>,6:22]
[@107,110:110='n',<OTHER>,6:23]
[@108,111:111='f',<OTHER>,6:24]
[@109,112:112='o',<OTHER>,6:25]
[@110,113:113='r',<OTHER>,6:26]
[@111,114:114='m',<OTHER>,6:27]
[@112,115:115='a',<OTHER>,6:28]
[@113,116:116='t',<OTHER>,6:29]
[@114,117:117='i',<OTHER>,6:30]
[@115,118:118='o',<OTHER>,6:31]
[@116,119:119='n',<OTHER>,6:32]
[@117,120:120='\n',<OTHER>,6:33]
[@118,121:121=' ',<OTHER>,7:0]
[@119,122:122=' ',<OTHER>,7:1]
[@120,123:123=' ',<OTHER>,7:2]
[@121,124:124='E',<OTHER>,7:3]
[@122,125:125='N',<OTHER>,7:4]
[@123,126:126='D',<OTHER>,7:5]
[@124,127:127='_',<OTHER>,7:6]
[@125,128:128='I',<OTHER>,7:7]
[@126,129:129='N',<OTHER>,7:8]
[@127,130:130='S',<OTHER>,7:9]
[@128,131:131='I',<OTHER>,7:10]
[@129,132:132='D',<OTHER>,7:11]
[@130,133:133='E',<OTHER>,7:12]
[@131,134:134='\n',<OTHER>,7:13]
[@132,135:135='\n',<OTHER>,8:0]
[@133,136:143='END_TYPE',<'END_TYPE'>,9:0]
[@134,144:143='<EOF>',<EOF>,9:8]
这可能效率很低,但并不 相当 看起来那么糟糕;您不必将所有其他标记连接在一起。在 Listener 中,您可以执行以下操作:
@Override
public void exitType(SimpleParser.TypeContext ctx) {
String text = ts.getText(
ctx.OTHER(0).getSymbol(),
ctx.OTHER(ctx.OTHER().size() - 1).getSymbol()
);
System.out.println(text);
}
其中 ts
是您的 TokenStream(您需要将其作为您的 Listener 中的成员变量,并填充它)。
可能 performant/flexible 进一步充实您的标记化(即使使用一些非常简单的词法分析器规则)以减少标记的数量。