JavaCC 和 Unicode 问题。为什么 \u696d 属于“\u4e00”-“\u9fff”范围,但在 JavaCC 中无法管理
JavaCC and Unicode issue. Why \u696d cannot be managed in JavaCC although it belong to the range "\u4e00"-"\u9fff"
我们正在尝试使用 JavaCC 作为解析器来解析 UTF-8(语言为日语)的源代码。在 JavaCC 中,我们有这样的声明:
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
如果遇到像“日建フェンス工业”这样的字符串,会因为业字符而失败。如果我删除它,它会按预期工作。业字符的代码是“\u696d”,在声明中可以看到,它应该属于“\u4e00”-“\u9fff”
的范围
对此有何建议?
PS:如果我们用Antlr重写这个语法,会是什么样子
非常感谢
在 ANTLR 语法中它会非常相似。这是一个词法分析器片段(来自我的 MySQL 语法):
// As defined in http://dev.mysql.com/doc/refman/5.6/en/identifiers.html.
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
| '$'
| '_'
| '\u0080'..'\uffff'
;
注意 ANTLR 不处理超出 BMP 的输入。
您的令牌片段没有问题,JavaCC 也没有问题。问题出在别处。
这是一个 JavaCC 规范,通过将您的问题代码复制并粘贴到 JavaCC 中。
options {
static = true;
debug_token_manager = true ; }
PARSER_BEGIN(MyNewGrammar)
package funnyunicode;
import java.io.StringReader ;
public class MyNewGrammar
{
public static void main(String args []) throws ParseException
{
MyNewGrammar parser = new MyNewGrammar(new StringReader("日建フェンス工業"));
MyNewGrammar.go() ;
System.out.println("OK."); } }
PARSER_END(MyNewGrammar)
TOKEN :
{
< WORD : (<LETTER>)+ >
|
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
] >
}
void go() :
{Token tk ; }
{
tk=<WORD> <EOF>
}
这里是生成的 Java 程序的输出
Current character : \u65e5 (26085) at line 1 column 1
Starting NFA to match one of : { <WORD> }
Current character : \u65e5 (26085) at line 1 column 1
Currently matched the first 1 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5efa (24314) at line 1 column 2
Currently matched the first 2 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30d5 (12501) at line 1 column 3
Currently matched the first 3 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30a7 (12455) at line 1 column 4
Currently matched the first 4 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30f3 (12531) at line 1 column 5
Currently matched the first 5 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30b9 (12473) at line 1 column 6
Currently matched the first 6 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5de5 (24037) at line 1 column 7
Currently matched the first 7 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u696d (26989) at line 1 column 8
Currently matched the first 8 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
****** FOUND A <WORD> MATCH (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******
Returning the <EOF> token.
OK.
如您所见,生成的分词器可以毫不费力地将 \u696d
视为 LETTER
。
我们正在尝试使用 JavaCC 作为解析器来解析 UTF-8(语言为日语)的源代码。在 JavaCC 中,我们有这样的声明:
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
如果遇到像“日建フェンス工业”这样的字符串,会因为业字符而失败。如果我删除它,它会按预期工作。业字符的代码是“\u696d”,在声明中可以看到,它应该属于“\u4e00”-“\u9fff”
的范围对此有何建议?
PS:如果我们用Antlr重写这个语法,会是什么样子
非常感谢
在 ANTLR 语法中它会非常相似。这是一个词法分析器片段(来自我的 MySQL 语法):
// As defined in http://dev.mysql.com/doc/refman/5.6/en/identifiers.html.
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
| '$'
| '_'
| '\u0080'..'\uffff'
;
注意 ANTLR 不处理超出 BMP 的输入。
您的令牌片段没有问题,JavaCC 也没有问题。问题出在别处。
这是一个 JavaCC 规范,通过将您的问题代码复制并粘贴到 JavaCC 中。
options {
static = true;
debug_token_manager = true ; }
PARSER_BEGIN(MyNewGrammar)
package funnyunicode;
import java.io.StringReader ;
public class MyNewGrammar
{
public static void main(String args []) throws ParseException
{
MyNewGrammar parser = new MyNewGrammar(new StringReader("日建フェンス工業"));
MyNewGrammar.go() ;
System.out.println("OK."); } }
PARSER_END(MyNewGrammar)
TOKEN :
{
< WORD : (<LETTER>)+ >
|
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
] >
}
void go() :
{Token tk ; }
{
tk=<WORD> <EOF>
}
这里是生成的 Java 程序的输出
Current character : \u65e5 (26085) at line 1 column 1
Starting NFA to match one of : { <WORD> }
Current character : \u65e5 (26085) at line 1 column 1
Currently matched the first 1 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5efa (24314) at line 1 column 2
Currently matched the first 2 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30d5 (12501) at line 1 column 3
Currently matched the first 3 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30a7 (12455) at line 1 column 4
Currently matched the first 4 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30f3 (12531) at line 1 column 5
Currently matched the first 5 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u30b9 (12473) at line 1 column 6
Currently matched the first 6 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u5de5 (24037) at line 1 column 7
Currently matched the first 7 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
Current character : \u696d (26989) at line 1 column 8
Currently matched the first 8 characters as a <WORD> token.
Possible kinds of longer matches : { <WORD> }
****** FOUND A <WORD> MATCH (\u65e5\u5efa\u30d5\u30a7\u30f3\u30b9\u5de5\u696d) ******
Returning the <EOF> token.
OK.
如您所见,生成的分词器可以毫不费力地将 \u696d
视为 LETTER
。