ANTLR4:解析电子邮件 header,前瞻无效,Python 目标
ANTLR4: parsing an email header, lookahead not working, Python target
我正在尝试解析电子邮件的这一部分 header:
Received: from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)
我希望词法分析器将其标记为这些部分:
Received:
from server.mymailhost.com (mail.mymailhost.com [126.43.75.123])
by pilot01.cl.msu.edu (8.10.2/8.10.2)
with ESMTP
id NAA23597
;
Fri, 12 Jul 2002 16:11:20 -0400 (EDT)
<EOF>
这是我的解析器语法:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
received : Received fromToken byToken withToken idToken SemiColon date EOF ;
fromToken : FromText ;
byToken: ByText ;
withToken : WithText ;
idToken : IdText ;
date : DateContents+ ;
下面是我的词法分析器语法。这是我在 运行 ANTLR:
时得到的错误
token recognition error at: 'from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)'
mismatched input '<EOF>' expecting FromText
显然,词法分析器成功获取了第一个标记 (Received:
),但没有获取下一个标记 (From:
)。请注意,在词法分析器语法中,我使用的是先行;我使用正确吗?想知道问题出在哪里吗?
lexer grammar MyLexer;
Received : 'Received: ' ;
SemiColon : ';' ;
FromText : 'from ' .+?
{
(self.input.LA(1) == 'b') and (self.input.LA(2) == 'y')
}? ;
ByText : 'by '.+?
{
(self.input.LA(1) == 'w') and (self.input.LA(2) == 'i') and (self.input.LA(3) == 't') and (self.input.LA(4) == 'h')
}? ;
WithText : 'with ' .+?
{
(self.input.LA(1) == 'i') and (self.input.LA(2) == 'd')
}? ;
IdText : 'id ' .+?
{
(self.input.LA(1) == ';')
}? ;
DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;
fragment Letter : 'A'..'Z' | 'a'..'z' ;
fragment Number : '0'..'9' ;
fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;
Whitespace : [\t\r\n]+ -> skip ;
经过一番努力,我找到了答案。这是工作词法分析器:
lexer grammar MyLexer;
Received : 'Received: ' ;
SemiColon : ';' ;
FromText : 'from ' .+?
{(self._input.LA(1) == ord('b')) and (self._input.LA(2) == ord('y'))}?
;
ByText : 'by '.+?
{(self._input.LA(1) == ord('w')) and (self._input.LA(2) == ord('i')) and (self._input.LA(3) == ord('t')) and (self._input.LA(4) == ord('h'))}?
;
WithText : 'with ' .+?
{(self._input.LA(1) == ord('i')) and (self._input.LA(2) == ord('d'))}?
;
IdText : 'id ' .+?
{(self._input.LA(1) == ord(';'))}?
;
DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;
fragment Letter : 'A'..'Z' | 'a'..'z' ;
fragment Number : '0'..'9' ;
fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;
Whitespace : [\t\r\n]+ -> skip ;
我正在尝试解析电子邮件的这一部分 header:
Received: from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)
我希望词法分析器将其标记为这些部分:
Received:
from server.mymailhost.com (mail.mymailhost.com [126.43.75.123])
by pilot01.cl.msu.edu (8.10.2/8.10.2)
with ESMTP
id NAA23597
;
Fri, 12 Jul 2002 16:11:20 -0400 (EDT)
<EOF>
这是我的解析器语法:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
received : Received fromToken byToken withToken idToken SemiColon date EOF ;
fromToken : FromText ;
byToken: ByText ;
withToken : WithText ;
idToken : IdText ;
date : DateContents+ ;
下面是我的词法分析器语法。这是我在 运行 ANTLR:
时得到的错误token recognition error at: 'from server.mymailhost.com (mail.mymailhost.com [126.43.75.123]) by pilot01.cl.msu.edu (8.10.2/8.10.2) with ESMTP id NAA23597;Fri, 12 Jul 2002 16:11:20 -0400 (EDT)'
mismatched input '<EOF>' expecting FromText
显然,词法分析器成功获取了第一个标记 (Received:
),但没有获取下一个标记 (From:
)。请注意,在词法分析器语法中,我使用的是先行;我使用正确吗?想知道问题出在哪里吗?
lexer grammar MyLexer;
Received : 'Received: ' ;
SemiColon : ';' ;
FromText : 'from ' .+?
{
(self.input.LA(1) == 'b') and (self.input.LA(2) == 'y')
}? ;
ByText : 'by '.+?
{
(self.input.LA(1) == 'w') and (self.input.LA(2) == 'i') and (self.input.LA(3) == 't') and (self.input.LA(4) == 'h')
}? ;
WithText : 'with ' .+?
{
(self.input.LA(1) == 'i') and (self.input.LA(2) == 'd')
}? ;
IdText : 'id ' .+?
{
(self.input.LA(1) == ';')
}? ;
DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;
fragment Letter : 'A'..'Z' | 'a'..'z' ;
fragment Number : '0'..'9' ;
fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;
Whitespace : [\t\r\n]+ -> skip ;
经过一番努力,我找到了答案。这是工作词法分析器:
lexer grammar MyLexer;
Received : 'Received: ' ;
SemiColon : ';' ;
FromText : 'from ' .+?
{(self._input.LA(1) == ord('b')) and (self._input.LA(2) == ord('y'))}?
;
ByText : 'by '.+?
{(self._input.LA(1) == ord('w')) and (self._input.LA(2) == ord('i')) and (self._input.LA(3) == ord('t')) and (self._input.LA(4) == ord('h'))}?
;
WithText : 'with ' .+?
{(self._input.LA(1) == ord('i')) and (self._input.LA(2) == ord('d'))}?
;
IdText : 'id ' .+?
{(self._input.LA(1) == ord(';'))}?
;
DateContents : ('Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' | 'Sun') (Letter | Number | Special)+ ;
fragment Letter : 'A'..'Z' | 'a'..'z' ;
fragment Number : '0'..'9' ;
fragment Special : ' ' | '_' | '-' | '.' | ',' | '~' | ':' | '+' | '$' | '=' | '(' | ')' | '[' | ']' | '/' ;
Whitespace : [\t\r\n]+ -> skip ;