ANTLR 4 语法给出无关的输入错误
ANTLR 4 grammar gives extraneous input error
我试图创建一个(我认为的)简单语法来处理包含 key/value 作业列表的文件;每行一个作业。
我过去(90 年代中期)使用过 ANTLR,并决定再次使用它,因为我想在作业文件中提供注释以及 Unicode 关键字和值。
我的简单测试文件再次证明,即使使用好的工具,编写正确的语法也是一个难题。我使用的 ANTLR Language Support Plug-in for VS 2012 and developing in C#. So, I am well off the Eclipse/ Java reservation, but the C# plugin and the ANTLR Nuget packages (runtime and code generator) 与宣传的完全一样。
我的语法文件是:
grammar AssignmentListFile;
/*
* See: http://en.wikipedia.org/wiki/List_of_Unicode_characters
* for list of Unicode Code Points
*/
/*
* Lexer Rules: Must be in all UPPER case
* Parser Rules: Must be in all lower case
*/
// Ignore All non-printable control characters except: CR, LF and SPACE
IGNORED_WHITESPACE :
(
'\u0000' .. '\u0009' // 7-bit control chars less than Line Feed
| '\u000B' | '\u000C' // Vertical tab and Form feed
| '\u000E' .. '\u001F' // 7-bit control chars more than Carriage Return
| '\u007F' .. '\u009F' // 8-bit ASCII control characters and DEL
)+
-> channel(HIDDEN)
;
// Ignore Comments and any ending white spaces
JAVADOC_COMMENT
: '/**' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
CSTYLE_COMMENT
: '/*' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
/*
* Manage the assignment delimiter and
* the 3 white space characters which have not been ignored: SPACE, CR, and LF
*/
fragment SINGLE_SPACE : ' ';
EQUALS : '=';
EOL : SINGLE_SPACE* [\r\n]+ SINGLE_SPACE* ;
ASSIGNMENT_OPERATOR : SINGLE_SPACE* EQUALS SINGLE_SPACE* ;
// define the various forms of single and double quotes for the dumb, open, and close variants
// ASCII Open/Left Close/Right
CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;
/*
* create the character sets that can be part of an ID
*/
fragment IDCHAR_COMMON :
( '\u0020' | '\u0021' // Space and bang (!)
| '\u0023' .. '\u0026' // # to & (skips ")
| '\u0028' .. '\u003C' // ( to < (skips ')
| '\u003E' .. '\u007E' // > to ~ (skips =)
| '\u00A0' .. '\u2018' // printable UNICODE code points below Open Single Quote
| '\u201A' .. '\u201B' // printable UNICODE code points between Close Single Quote and Open Double Quote
| '\u201E' .. '\uFFFF' // printable UNICODE code points above Close Double Quote
)
;
// define the characters that can be contained in each of the quoted identifier types
NON_QUOTED_VALUE : IDCHAR_COMMON+;
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_SINGLEQUOTE | EQUALS)+
;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_DOUBLEQUOTE | EQUALS)+
;
file : file_line* EOF ;
file_line
: assignment
| EOL
;
assignment
: identifier ASSIGNMENT_OPERATOR identifier
;
identifier
: NON_QUOTED_VALUE
| CHAR_DOUBLEQUOTE DOUBLE_QUOTED_VALUE CHAR_DOUBLEQUOTE
| CHAR_SINGLEQUOTE SINGLE_QUOTED_VALUE CHAR_SINGLEQUOTE
;
我的输入文件是:
/*
* This is a Multiline C-Style comment
* with white space here:
*/
/* this is a single line C-Style comment */
/* this is a single line C-Style comment /w whitepace */
/*
*/
/**/
/**
* this is a Multiline JavaDoc comment
* with white space here:
*/
/** this is a single line JavaDoc comment */
/**
*/
/***/
JOHN=WASHBURN
JOHN = WASHBURN
'JOHN'='WASHBURN'
"JOHN" = "WASHBURN"
调用 Lexer/Parser 的 C# 代码是:
var input = new AntlrInputStream(textStream.ReadToEnd());
var lexer = new AssignmentListFileLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new AssignmentListFileParser(tokens);
Console.WriteLine("\n");
IParseTree tree = parser.file();
Console.WriteLine(tree.ToStringTree(parser));
Console.WriteLine("\n");
当您针对测试文件调用此 C# 时,NUnit 的结果是:
line 23:0 extraneous input 'JOHN=WASHBURN' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 24:1 extraneous input 'JOHN = WASHBURN ' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 25:0 extraneous input ''JOHN'='WASHBURN'' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 26:0 extraneous input '"JOHN" = "WASHBURN"' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
(file JOHN=WASHBURN (file_line \r\n ) JOHN = WASHBURN (file_line \r\n) 'JOHN'='WASHBURN' (file_line \r\n) "JOHN" = "WASHBURN" <EOF>)
首先,您可以看到我什至还没有开始测试有趣的选项(例如德语 Name/Values、包含 = 符号或其他引号字符的引号 ID,等等)。所有可忽略的白色 space and/or 注释的测试文件按预期解析。打印的树显示行尾 (EOL) 逻辑似乎步入正轨。但是,赋值表达式本身的解析是发生识别错误的地方。
我很困惑 4 个字符的短语 JOHN(或短语 WASHBURN)如何无法与 NON_QUOTED_VALUE 匹配,或者 'JOHN' 如何无法与 CHAR_SINGLEQUOTE 匹配.或者 '=' 或 '=' 如何不匹配赋值规则。
我相信这将是一个 DOH!!片刻,但我在这里错过了什么?
4 个字符的短语 JOHN 未被识别为 NON_QUOTED_VALUE 标记的原因是 JOHN=WASHBURN 被识别为 DOUBLE_QUOTED_VALUE。使用以下跟踪检测您的语法将显示此内容(抱歉,Java 代码,但我相信您可以翻译)。
NON_QUOTED_VALUE : IDCHAR_COMMON+ {System.out.println("#A:"+getText());};
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_SINGLEQUOTE | EQUALS)+ {System.out.println("#B:"+getText());}
;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_DOUBLEQUOTE | EQUALS)+ {System.out.println("#C:"+getText());}
;
... 产生以下输出 ...
#B:JOHN=WASHBURN
#B:JOHN = WASHBURN
#B:'JOHN'='WASHBURN'
#C:"JOHN" = "WASHBURN"
原因是识别最长匹配的词法分析器规则具有优先权。
如果有帮助,以下语法应该可以识别您的示例文件。
CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;
EQUALS : '=';
EOL : [\r\n]+ ;
IGNORED_WHITESPACE :
( ' '
| '\u0000' .. '\u0009' // 7-bit control chars less than Line Feed
| '\u000B' | '\u000C' // Vertical tab and Form feed
| '\u000E' .. '\u001F' // 7-bit control chars more than Carriage Return
| '\u007F' .. '\u009F' // 8-bit ASCII control characters and DEL
)+
-> channel(HIDDEN)
;
IDCHAR_COMMON :
( '\u0020' | '\u0021' // Space and bang (!)
| '\u0023' .. '\u0026' // # to & (skips ")
| '\u0028' .. '\u003C' // ( to < (skips ')
| '\u003E' .. '\u007E' // > to ~ (skips =)
| '\u00A0' .. '\u2018' // printable UNICODE code points below Open Single Quote
| '\u201A' .. '\u201B' // printable UNICODE code points between Close Single Quote and Open Double Quote
| '\u201E' .. '\uFFFF' // printable UNICODE code points above Close Double Quote
)
;
NON_QUOTED_VALUE : IDCHAR_COMMON+ {System.out.println("#A:"+getText());};
JAVADOC_COMMENT
: '/**' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
CSTYLE_COMMENT
: '/*' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
file : file_line* EOF ;
file_line
: assignment
| EOL
;
assignment
: identifier EQUALS identifier
;
identifier : NON_QUOTED_VALUE
| CHAR_DOUBLEQUOTE (NON_QUOTED_VALUE | CHAR_SINGLEQUOTE | EQUALS)+ CHAR_DOUBLEQUOTE
| CHAR_SINGLEQUOTE (NON_QUOTED_VALUE | CHAR_DOUBLEQUOTE | EQUALS)+ CHAR_SINGLEQUOTE ;
这也应该解析以下内容,这是我通过阅读您认为有效的语法假设的。
'JO"HN'='WASHBURN'
"JO='HN" = "WASHBURN"
我试图创建一个(我认为的)简单语法来处理包含 key/value 作业列表的文件;每行一个作业。
我过去(90 年代中期)使用过 ANTLR,并决定再次使用它,因为我想在作业文件中提供注释以及 Unicode 关键字和值。
我的简单测试文件再次证明,即使使用好的工具,编写正确的语法也是一个难题。我使用的 ANTLR Language Support Plug-in for VS 2012 and developing in C#. So, I am well off the Eclipse/ Java reservation, but the C# plugin and the ANTLR Nuget packages (runtime and code generator) 与宣传的完全一样。
我的语法文件是:
grammar AssignmentListFile;
/*
* See: http://en.wikipedia.org/wiki/List_of_Unicode_characters
* for list of Unicode Code Points
*/
/*
* Lexer Rules: Must be in all UPPER case
* Parser Rules: Must be in all lower case
*/
// Ignore All non-printable control characters except: CR, LF and SPACE
IGNORED_WHITESPACE :
(
'\u0000' .. '\u0009' // 7-bit control chars less than Line Feed
| '\u000B' | '\u000C' // Vertical tab and Form feed
| '\u000E' .. '\u001F' // 7-bit control chars more than Carriage Return
| '\u007F' .. '\u009F' // 8-bit ASCII control characters and DEL
)+
-> channel(HIDDEN)
;
// Ignore Comments and any ending white spaces
JAVADOC_COMMENT
: '/**' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
CSTYLE_COMMENT
: '/*' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
/*
* Manage the assignment delimiter and
* the 3 white space characters which have not been ignored: SPACE, CR, and LF
*/
fragment SINGLE_SPACE : ' ';
EQUALS : '=';
EOL : SINGLE_SPACE* [\r\n]+ SINGLE_SPACE* ;
ASSIGNMENT_OPERATOR : SINGLE_SPACE* EQUALS SINGLE_SPACE* ;
// define the various forms of single and double quotes for the dumb, open, and close variants
// ASCII Open/Left Close/Right
CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;
/*
* create the character sets that can be part of an ID
*/
fragment IDCHAR_COMMON :
( '\u0020' | '\u0021' // Space and bang (!)
| '\u0023' .. '\u0026' // # to & (skips ")
| '\u0028' .. '\u003C' // ( to < (skips ')
| '\u003E' .. '\u007E' // > to ~ (skips =)
| '\u00A0' .. '\u2018' // printable UNICODE code points below Open Single Quote
| '\u201A' .. '\u201B' // printable UNICODE code points between Close Single Quote and Open Double Quote
| '\u201E' .. '\uFFFF' // printable UNICODE code points above Close Double Quote
)
;
// define the characters that can be contained in each of the quoted identifier types
NON_QUOTED_VALUE : IDCHAR_COMMON+;
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_SINGLEQUOTE | EQUALS)+
;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_DOUBLEQUOTE | EQUALS)+
;
file : file_line* EOF ;
file_line
: assignment
| EOL
;
assignment
: identifier ASSIGNMENT_OPERATOR identifier
;
identifier
: NON_QUOTED_VALUE
| CHAR_DOUBLEQUOTE DOUBLE_QUOTED_VALUE CHAR_DOUBLEQUOTE
| CHAR_SINGLEQUOTE SINGLE_QUOTED_VALUE CHAR_SINGLEQUOTE
;
我的输入文件是:
/*
* This is a Multiline C-Style comment
* with white space here:
*/
/* this is a single line C-Style comment */
/* this is a single line C-Style comment /w whitepace */
/*
*/
/**/
/**
* this is a Multiline JavaDoc comment
* with white space here:
*/
/** this is a single line JavaDoc comment */
/**
*/
/***/
JOHN=WASHBURN
JOHN = WASHBURN
'JOHN'='WASHBURN'
"JOHN" = "WASHBURN"
调用 Lexer/Parser 的 C# 代码是:
var input = new AntlrInputStream(textStream.ReadToEnd());
var lexer = new AssignmentListFileLexer(input);
var tokens = new CommonTokenStream(lexer);
var parser = new AssignmentListFileParser(tokens);
Console.WriteLine("\n");
IParseTree tree = parser.file();
Console.WriteLine(tree.ToStringTree(parser));
Console.WriteLine("\n");
当您针对测试文件调用此 C# 时,NUnit 的结果是:
line 23:0 extraneous input 'JOHN=WASHBURN' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 24:1 extraneous input 'JOHN = WASHBURN ' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 25:0 extraneous input ''JOHN'='WASHBURN'' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
line 26:0 extraneous input '"JOHN" = "WASHBURN"' expecting {<EOF>, EOL, CHAR_SINGLEQUOTE, CHAR_DOUBLEQUOTE, NON_QUOTED_VALUE}
(file JOHN=WASHBURN (file_line \r\n ) JOHN = WASHBURN (file_line \r\n) 'JOHN'='WASHBURN' (file_line \r\n) "JOHN" = "WASHBURN" <EOF>)
首先,您可以看到我什至还没有开始测试有趣的选项(例如德语 Name/Values、包含 = 符号或其他引号字符的引号 ID,等等)。所有可忽略的白色 space and/or 注释的测试文件按预期解析。打印的树显示行尾 (EOL) 逻辑似乎步入正轨。但是,赋值表达式本身的解析是发生识别错误的地方。
我很困惑 4 个字符的短语 JOHN(或短语 WASHBURN)如何无法与 NON_QUOTED_VALUE 匹配,或者 'JOHN' 如何无法与 CHAR_SINGLEQUOTE 匹配.或者 '=' 或 '=' 如何不匹配赋值规则。
我相信这将是一个 DOH!!片刻,但我在这里错过了什么?
4 个字符的短语 JOHN 未被识别为 NON_QUOTED_VALUE 标记的原因是 JOHN=WASHBURN 被识别为 DOUBLE_QUOTED_VALUE。使用以下跟踪检测您的语法将显示此内容(抱歉,Java 代码,但我相信您可以翻译)。
NON_QUOTED_VALUE : IDCHAR_COMMON+ {System.out.println("#A:"+getText());};
DOUBLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_SINGLEQUOTE | EQUALS)+ {System.out.println("#B:"+getText());}
;
SINGLE_QUOTED_VALUE : NON_QUOTED_VALUE
| (IDCHAR_COMMON | CHAR_DOUBLEQUOTE | EQUALS)+ {System.out.println("#C:"+getText());}
;
... 产生以下输出 ...
#B:JOHN=WASHBURN
#B:JOHN = WASHBURN
#B:'JOHN'='WASHBURN'
#C:"JOHN" = "WASHBURN"
原因是识别最长匹配的词法分析器规则具有优先权。
如果有帮助,以下语法应该可以识别您的示例文件。
CHAR_SINGLEQUOTE : ('\u0027' | '\u2018' | '\u2019') ;
CHAR_DOUBLEQUOTE : ('\u0022' | '\u201C' | '\u201D') ;
EQUALS : '=';
EOL : [\r\n]+ ;
IGNORED_WHITESPACE :
( ' '
| '\u0000' .. '\u0009' // 7-bit control chars less than Line Feed
| '\u000B' | '\u000C' // Vertical tab and Form feed
| '\u000E' .. '\u001F' // 7-bit control chars more than Carriage Return
| '\u007F' .. '\u009F' // 8-bit ASCII control characters and DEL
)+
-> channel(HIDDEN)
;
IDCHAR_COMMON :
( '\u0020' | '\u0021' // Space and bang (!)
| '\u0023' .. '\u0026' // # to & (skips ")
| '\u0028' .. '\u003C' // ( to < (skips ')
| '\u003E' .. '\u007E' // > to ~ (skips =)
| '\u00A0' .. '\u2018' // printable UNICODE code points below Open Single Quote
| '\u201A' .. '\u201B' // printable UNICODE code points between Close Single Quote and Open Double Quote
| '\u201E' .. '\uFFFF' // printable UNICODE code points above Close Double Quote
)
;
NON_QUOTED_VALUE : IDCHAR_COMMON+ {System.out.println("#A:"+getText());};
JAVADOC_COMMENT
: '/**' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
CSTYLE_COMMENT
: '/*' .*? '*/' [ \r\n]*
-> channel(HIDDEN)
;
file : file_line* EOF ;
file_line
: assignment
| EOL
;
assignment
: identifier EQUALS identifier
;
identifier : NON_QUOTED_VALUE
| CHAR_DOUBLEQUOTE (NON_QUOTED_VALUE | CHAR_SINGLEQUOTE | EQUALS)+ CHAR_DOUBLEQUOTE
| CHAR_SINGLEQUOTE (NON_QUOTED_VALUE | CHAR_DOUBLEQUOTE | EQUALS)+ CHAR_SINGLEQUOTE ;
这也应该解析以下内容,这是我通过阅读您认为有效的语法假设的。
'JO"HN'='WASHBURN'
"JO='HN" = "WASHBURN"