Java - 匹配器重读单词
Java - matcher re-reading words
我正在尝试使用 java 为 Delphi 创建一个词法分析器。这是示例代码:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
所以,当我 运行 程序时,它可以工作,但它正在重新读取程序认为是 2 个标记的某些单词,例如:record 是一个关键字, 但重新读取它以找到来自 rec"or"d[ 的令牌逻辑运算符的单词 or =20=]。如何取消重读单词?谢谢!
将 \b
添加到您的正则表达式中以在单词之间进行分隔。所以:
Pattern.compile("\b" + keywords[i] + "\b")
将确保您单词两边的字符不是字母。
这样 "record" 只会匹配 "record," 而不会匹配 "or."
如所述,您需要在关键字前后添加\b
词边界匹配器,以防止词内子串匹配。
为了更好的性能,您还应该使用 |
逻辑正则表达式运算符来匹配多个值之一,而不是创建多个匹配器,因此您只需扫描一次 line
,并且只需要编译一个正则表达式。
您甚至可以将要查找的 3 种不同类型的标记组合在一个正则表达式中,并使用捕获组来区分它们,因此总共只需要扫描 line
一次。
像这样:
String regex = "\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\b" +
"|(=|<[>=]?|>=?)" +
"|\b(and|not|or|xor)\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
您甚至可以通过将关键字与常见前缀相结合来进一步优化它,例如as|asm
可以变成 asm?
,即 as
后面可以有选择地跟 m
。会使关键字列表的可读性降低,但效果会更好。
在上面的代码中,我为逻辑操作做了这件事,以展示如何,并修复原始代码中的匹配错误,其中 line
中的 >=
会出现=
、>
、>=
依次3次,类似问题中要求的子关键字问题
我正在尝试使用 java 为 Delphi 创建一个词法分析器。这是示例代码:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
所以,当我 运行 程序时,它可以工作,但它正在重新读取程序认为是 2 个标记的某些单词,例如:record 是一个关键字, 但重新读取它以找到来自 rec"or"d[ 的令牌逻辑运算符的单词 or =20=]。如何取消重读单词?谢谢!
将 \b
添加到您的正则表达式中以在单词之间进行分隔。所以:
Pattern.compile("\b" + keywords[i] + "\b")
将确保您单词两边的字符不是字母。
这样 "record" 只会匹配 "record," 而不会匹配 "or."
如\b
词边界匹配器,以防止词内子串匹配。
为了更好的性能,您还应该使用 |
逻辑正则表达式运算符来匹配多个值之一,而不是创建多个匹配器,因此您只需扫描一次 line
,并且只需要编译一个正则表达式。
您甚至可以将要查找的 3 种不同类型的标记组合在一个正则表达式中,并使用捕获组来区分它们,因此总共只需要扫描 line
一次。
像这样:
String regex = "\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\b" +
"|(=|<[>=]?|>=?)" +
"|\b(and|not|or|xor)\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
您甚至可以通过将关键字与常见前缀相结合来进一步优化它,例如as|asm
可以变成 asm?
,即 as
后面可以有选择地跟 m
。会使关键字列表的可读性降低,但效果会更好。
在上面的代码中,我为逻辑操作做了这件事,以展示如何,并修复原始代码中的匹配错误,其中 line
中的 >=
会出现=
、>
、>=
依次3次,类似问题中要求的子关键字问题