StreamTokenizer 破坏整数和松散的句点
StreamTokenizer mangles integers and loose periods
我已经挪用并修改了下面的代码,它使用 Java 的 StreamTokenizer 很好地标记了 Java 代码。它的数字处理是有问题的,但是:
- 它将所有整数转换为双精度数。我可以通过测试 num % 1 == 0 来解决这个问题,但这感觉像是 hack
- 更关键的是,一个 .以下空格被视为数字。 "Class .method()" 是合法的 Java 语法,但生成的标记为 [Word "Class"]、[Whitespace " "]、[Number 0.0]、[Word "method"]、[Symbol "("], 和 [符号 ")"]
我很乐意完全关闭 StreamTokenizer 的数字解析并自己从单词标记中解析数字,但评论 st.parseNumbers() 似乎没有效果。
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}
有机会我会研究半熟的。与此同时,我为让它工作而实施的令人厌恶的解决方法是:
private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";
然后在tokenize()
//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\s)\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");
并且在标记化循环中
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
if(word.equals(DANGLING_PERIOD_TOKEN))
tokens.add(new SymbolToken('.'));
else
tokens.add(new WordToken(word));
break;
这个解决方案是专门针对我不关心原始空白是什么的需要(因为它在插入的 "token" 周围添加了一些)
默认情况下 "on" 中的 parseNumbers()。使用 resetSyntax() 关闭数字解析和所有其他预定义的字符类型,然后启用您需要的。
也就是说,手动数字解析可能会在计算点和指数时变得棘手...使用扫描器和正则表达式,实现您自己的分词器应该相对简单,完全根据您的需要量身定制。例如,您可能想在此处查看 Tokenizer
内部 class:https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java(最后大约 120 LOC)
我已经挪用并修改了下面的代码,它使用 Java 的 StreamTokenizer 很好地标记了 Java 代码。它的数字处理是有问题的,但是:
- 它将所有整数转换为双精度数。我可以通过测试 num % 1 == 0 来解决这个问题,但这感觉像是 hack
- 更关键的是,一个 .以下空格被视为数字。 "Class .method()" 是合法的 Java 语法,但生成的标记为 [Word "Class"]、[Whitespace " "]、[Number 0.0]、[Word "method"]、[Symbol "("], 和 [符号 ")"]
我很乐意完全关闭 StreamTokenizer 的数字解析并自己从单词标记中解析数字,但评论 st.parseNumbers() 似乎没有效果。
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}
有机会我会研究半熟的。与此同时,我为让它工作而实施的令人厌恶的解决方法是:
private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";
然后在tokenize()
//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\s)\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");
并且在标记化循环中
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
if(word.equals(DANGLING_PERIOD_TOKEN))
tokens.add(new SymbolToken('.'));
else
tokens.add(new WordToken(word));
break;
这个解决方案是专门针对我不关心原始空白是什么的需要(因为它在插入的 "token" 周围添加了一些)
parseNumbers()。使用 resetSyntax() 关闭数字解析和所有其他预定义的字符类型,然后启用您需要的。
也就是说,手动数字解析可能会在计算点和指数时变得棘手...使用扫描器和正则表达式,实现您自己的分词器应该相对简单,完全根据您的需要量身定制。例如,您可能想在此处查看 Tokenizer
内部 class:https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java(最后大约 120 LOC)