ANTLR4:如何匹配行首的额外空格?
ANTLR4: How to match extra spaces at the beginning of a line?
我试图匹配行首多余的space,但是没有成功。如何修改词法规则来匹配?
TestParser.g4:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: choice+ EOF
;
choice:
QUESTION OPTION+;
TestLexer.g4:
lexer grammar TestLexer;
@lexer::members {
private boolean aheadIsNotAnOption(IntStream _input) {
int nextChar = _input.LA(1);
return nextChar != 'A' && nextChar != 'B' && nextChar != 'C' && nextChar != 'D';
}
}
QUESTION: {getCharPositionInLine() == 0}? DIGIT DOT CONTENT -> pushMode(OPTION_MODE);
OTHER: . -> skip;
mode OPTION_MODE;
OPTION: OPTION_HEADER DOT CONTENT;
NOT_OPTION_LINE: NEWLINE SPACE* {aheadIsNotAnOption(_input)}? -> popMode, skip;
OPTION_OTHER: OTHER -> skip;
fragment DIGIT: [0-9]+;
fragment OPTION_HEADER: [A-D];
fragment CONTENT: [a-zA-Z0-9 ,.'?/()!]+? {_input.LA(1) == '\n'}?;
fragment DOT: '.';
fragment NEWLINE: '\n';
fragment SPACE: ' ';
正文:
1.title
A.aaa
B.bbb
C.ccc
2.title
A.aaa
Java代码:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
import java.io.IOException;
import java.net.URISyntaxException;
public class TestParseTest {
public static void main(String[] args) throws URISyntaxException, IOException {
CharStream charStream = CharStreams.fromString("1.title\n" +
"A.aaa\n" +
"B.bbb\n" +
" C.ccc\n" +
"2.title\n" +
"A.aaa\n");
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
输出结果如下:
(root (choice 1.title A.aaa B.bbb) (choice 2.title A.aaa) <EOF>)
思路是在OPTION_MODE
中遇到非option行时,会弹出mode,现在当行首多出一个space时,它与预期不匹配。
好像是C.ccc
前面的\n
匹配了NOT_OPTION_LINE
导致模式弹出?我希望 C.ccc
匹配为 OPTION
,谢谢。
我认为你把它弄得太复杂了。在我看来,行要么以问题 ([ \t]* [0-9]+
) 开头,要么以选项 [ \t]* [A-Z]
开头。在所有其他情况下,只需忽略行 (. -> skip
)。这归结为以下语法:
lexer grammar TestLexer;
QuestionStart
: {getCharPositionInLine() == 0}? [ \t]* [0-9]+ '.' -> pushMode(ContentMode)
;
OptionStart
: {getCharPositionInLine() == 0}? [ \t]* [A-Z] '.' -> pushMode(ContentMode)
;
Ignored
: . -> skip
;
mode ContentMode;
Content
: ~[\r\n]+
;
QuestionEnd
: [\r\n]+ -> skip, popMode
;
解析器语法可能如下所示:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: question+ EOF
;
question
: QuestionStart Content option+
;
option
: OptionStart Content+
;
和 Java 代码:
String source = "1.title\n" +
"A.aaa\n" +
"B.bbb\n" +
" C.ccc\n" +
" ...ignored ...\n" +
"2.title\n" +
"A.aaa\n";
Lexer lexer = new TestLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
然后将打印:
(root (question 1. title (option A. aaa) (option B. bbb) (option C. ccc)) (question 2. title (option A. aaa)) <EOF>)
编辑
鉴于您的语法中已经有特定于目标的代码,您可以 trim 像这样的选项中的空格(未经测试!):
OptionStart
: {getCharPositionInLine() == 0}? [ \t]* [A-Z] '.'
{setText(getText().trim());}
-> pushMode(ContentMode)
;
我试图匹配行首多余的space,但是没有成功。如何修改词法规则来匹配?
TestParser.g4:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: choice+ EOF
;
choice:
QUESTION OPTION+;
TestLexer.g4:
lexer grammar TestLexer;
@lexer::members {
private boolean aheadIsNotAnOption(IntStream _input) {
int nextChar = _input.LA(1);
return nextChar != 'A' && nextChar != 'B' && nextChar != 'C' && nextChar != 'D';
}
}
QUESTION: {getCharPositionInLine() == 0}? DIGIT DOT CONTENT -> pushMode(OPTION_MODE);
OTHER: . -> skip;
mode OPTION_MODE;
OPTION: OPTION_HEADER DOT CONTENT;
NOT_OPTION_LINE: NEWLINE SPACE* {aheadIsNotAnOption(_input)}? -> popMode, skip;
OPTION_OTHER: OTHER -> skip;
fragment DIGIT: [0-9]+;
fragment OPTION_HEADER: [A-D];
fragment CONTENT: [a-zA-Z0-9 ,.'?/()!]+? {_input.LA(1) == '\n'}?;
fragment DOT: '.';
fragment NEWLINE: '\n';
fragment SPACE: ' ';
正文:
1.title
A.aaa
B.bbb
C.ccc
2.title
A.aaa
Java代码:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
import java.io.IOException;
import java.net.URISyntaxException;
public class TestParseTest {
public static void main(String[] args) throws URISyntaxException, IOException {
CharStream charStream = CharStreams.fromString("1.title\n" +
"A.aaa\n" +
"B.bbb\n" +
" C.ccc\n" +
"2.title\n" +
"A.aaa\n");
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
输出结果如下:
(root (choice 1.title A.aaa B.bbb) (choice 2.title A.aaa) <EOF>)
思路是在OPTION_MODE
中遇到非option行时,会弹出mode,现在当行首多出一个space时,它与预期不匹配。
好像是C.ccc
前面的\n
匹配了NOT_OPTION_LINE
导致模式弹出?我希望 C.ccc
匹配为 OPTION
,谢谢。
我认为你把它弄得太复杂了。在我看来,行要么以问题 ([ \t]* [0-9]+
) 开头,要么以选项 [ \t]* [A-Z]
开头。在所有其他情况下,只需忽略行 (. -> skip
)。这归结为以下语法:
lexer grammar TestLexer;
QuestionStart
: {getCharPositionInLine() == 0}? [ \t]* [0-9]+ '.' -> pushMode(ContentMode)
;
OptionStart
: {getCharPositionInLine() == 0}? [ \t]* [A-Z] '.' -> pushMode(ContentMode)
;
Ignored
: . -> skip
;
mode ContentMode;
Content
: ~[\r\n]+
;
QuestionEnd
: [\r\n]+ -> skip, popMode
;
解析器语法可能如下所示:
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: question+ EOF
;
question
: QuestionStart Content option+
;
option
: OptionStart Content+
;
和 Java 代码:
String source = "1.title\n" +
"A.aaa\n" +
"B.bbb\n" +
" C.ccc\n" +
" ...ignored ...\n" +
"2.title\n" +
"A.aaa\n";
Lexer lexer = new TestLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
然后将打印:
(root (question 1. title (option A. aaa) (option B. bbb) (option C. ccc)) (question 2. title (option A. aaa)) <EOF>)
编辑
鉴于您的语法中已经有特定于目标的代码,您可以 trim 像这样的选项中的空格(未经测试!):
OptionStart
: {getCharPositionInLine() == 0}? [ \t]* [A-Z] '.'
{setText(getText().trim());}
-> pushMode(ContentMode)
;