StreamTokenizer 破坏整数和松散的句点

StreamTokenizer mangles integers and loose periods

我已经挪用并修改了下面的代码,它使用 Java 的 StreamTokenizer 很好地标记了 Java 代码。它的数字处理是有问题的,但是:

  1. 它将所有整数转换为双精度数。我可以通过测试 num % 1 == 0 来解决这个问题,但这感觉像是 hack
  2. 更关键的是,一个 .以下空格被视为数字。 "Class .method()" 是合法的 Java 语法,但生成的标记为 [Word "Class"]、[Whitespace " "]、[Number 0.0]、[Word "method"]、[Symbol "("], 和 [符号 ")"]

我很乐意完全关闭 StreamTokenizer 的数字解析并自己从单词标记中解析数字,但评论 st.parseNumbers() 似乎没有效果。

public class JavaTokenizer {

private String code;

private List<Token> tokens;

public JavaTokenizer(String c) {
    code = c;
    tokens = new ArrayList<>();
}

public void tokenize() {
    try {
        // Create the tokenizer
        StringReader sr = new StringReader(code);
        StreamTokenizer st = new StreamTokenizer(sr);

        // Java-style tokenizing rules
        st.parseNumbers();
        st.wordChars('_', '_');
        st.eolIsSignificant(false);

        // Don't want whitespace tokens
        //st.ordinaryChars(0, ' ');

        // Strip out comments
        st.slashSlashComments(true);
        st.slashStarComments(true);

        // Parse the file
        int token;
        do {
            token = st.nextToken();
            switch (token) {
            case StreamTokenizer.TT_NUMBER:
                // A number was found; the value is in nval
                double num = st.nval;
                if(num % 1 == 0)
                  tokens.add(new IntegerToken((int)num);
                else
                  tokens.add(new FPNumberToken(num));
                break;
            case StreamTokenizer.TT_WORD:
                // A word was found; the value is in sval
                String word = st.sval;
                tokens.add(new WordToken(word));
                break;
            case '"':
                // A double-quoted string was found; sval contains the contents
                String dquoteVal = st.sval;
                tokens.add(new DoubleQuotedStringToken(dquoteVal));
                break;
            case '\'':
                // A single-quoted string was found; sval contains the contents
                String squoteVal = st.sval;
                tokens.add(new SingleQuotedStringToken(squoteVal));
                break;
            case StreamTokenizer.TT_EOL:
                // End of line character found
                tokens.add(new EOLToken());
                break;
            case StreamTokenizer.TT_EOF:
                // End of file has been reached
                tokens. add(new EOFToken());
                break;
            default:
                // A regular character was found; the value is the token itself
                char ch = (char) st.ttype;
                if(Character.isWhitespace(ch))
                    tokens.add(new WhitespaceToken(ch));
                else
                    tokens.add(new SymbolToken(ch));
                break;
            }
        } while (token != StreamTokenizer.TT_EOF);
        sr.close();
    } catch (IOException e) {
    }
}

public List<Token> getTokens() {
    return tokens;
}

}

有机会我会研究半熟的。与此同时,我为让它工作而实施的令人厌恶的解决方法是:

private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";

然后在tokenize()

//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\s)\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");

并且在标记化循环中

case StreamTokenizer.TT_WORD:
  // A word was found; the value is in sval
  String word = st.sval;
  if(word.equals(DANGLING_PERIOD_TOKEN))
    tokens.add(new SymbolToken('.'));
  else
    tokens.add(new WordToken(word));
  break;

这个解决方案是专门针对我不关心原始空白是什么的需要(因为它在插入的 "token" 周围添加了一些)

默认情况下 "on" 中的

parseNumbers()。使用 resetSyntax() 关闭数字解析和所有其他预定义的字符类型,然后启用您需要的。

也就是说,手动数字解析可能会在计算点和指数时变得棘手...使用扫描器和正则表达式,实现您自己的分词器应该相对简单,完全根据您的需要量身定制。例如,您可能想在此处查看 Tokenizer 内部 class:https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java(最后大约 120 LOC)