java.io.StreamTokenizer 遇到下划线时生成空标记
java.io.StreamTokenizer produces null token when encounter an underscore
我有一个用于解析标记的 StreamTokenizer。当我将以下内容传递给标准输入时:
a b_c d
已解析的标记(在 stdout 上)是:
a
b
null
c
d
为什么会这样?如果下划线是单词字符,则应该有 3 个标记,第二个 "b_c"。如果下划线是分隔符,则应该有 4 个标记。我认为空标记没有意义。
Q1:为什么会出现null token?
Q2:为什么会有人设计一个StreamTokenizer来产生null token?
Ideone 脚本:http://ideone.com/e.js/RFbPpJ
import java.util.*;
import java.lang.*;
import java.io.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
StreamTokenizer st = new StreamTokenizer(br);
while (st.nextToken() != StreamTokenizer.TT_EOF) {
System.out.println(st.sval);
}
}
}
来自文档:
If the current token is a word token, this field contains a string
giving the characters of the word token. When the current token is a
quoted string token, this field contains the body of the string. The
current token is a word when the value of the ttype field is TT_WORD.
The current token is a quoted string token when the value of the ttype
field is a quote character.
The initial value of this field is null.
即满足none条件,输出null
换句话说,下划线的 ttype 既不被视为单词,也不被视为带引号的字符串。
ttype
的文档指定
After a call to the nextToken method, this field contains the type of
the token just read. For a single character token, its value is the
single character, converted to an integer. For a quoted string token,
its value is the quote character. Otherwise, its value is one of the
following: TT_WORD indicates that the token is a word. TT_NUMBER
indicates that the token is a number. TT_EOL indicates that the end of
line has been read. The field can only have this value if the
eolIsSignificant method has been called with the argument true. TT_EOF
indicates that the end of the input stream has been reached.
The initial value of this field is -4.
请注意,-4 值等于 TT_NOTHING。
要将下划线识别为单词,您可以使用tokenizer.wordChars('_', '_');
wordChars is used to specify that all characters c in the range low <=
c <= high are word constituents. A word token consists of a word
constituent followed by zero or more word constituents or number
constituents.
如果您希望下划线是一个普通的字符而不是单词字符,那么还有一个 method。
请注意,将“_”作为 wordChars 的两个分隔符都将允许下划线作为单词字符,因此您可能需要设置适合您需要的边界。
编辑: 为了回答您的评论,简而言之,下划线被视为标识符的一部分,这就是为什么它没有映射到任何东西,因此 return null .
如果您查看 StreamTokenizer class 的未记录的私有构造函数,您将更好地了解如何处理每个字符:
private StreamTokenizer() {
wordChars('a', 'z');
wordChars('A', 'Z');
wordChars(128 + 32, 255);
whitespaceChars(0, ' ');
commentChar('/');
quoteChar('"');
quoteChar('\'');
parseNumbers();
}
下划线是ASCII码95,不在范围内
我有一个用于解析标记的 StreamTokenizer。当我将以下内容传递给标准输入时:
a b_c d
已解析的标记(在 stdout 上)是:
a
b
null
c
d
为什么会这样?如果下划线是单词字符,则应该有 3 个标记,第二个 "b_c"。如果下划线是分隔符,则应该有 4 个标记。我认为空标记没有意义。
Q1:为什么会出现null token?
Q2:为什么会有人设计一个StreamTokenizer来产生null token?
Ideone 脚本:http://ideone.com/e.js/RFbPpJ
import java.util.*;
import java.lang.*;
import java.io.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
StreamTokenizer st = new StreamTokenizer(br);
while (st.nextToken() != StreamTokenizer.TT_EOF) {
System.out.println(st.sval);
}
}
}
来自文档:
If the current token is a word token, this field contains a string giving the characters of the word token. When the current token is a quoted string token, this field contains the body of the string. The current token is a word when the value of the ttype field is TT_WORD. The current token is a quoted string token when the value of the ttype field is a quote character.
The initial value of this field is null.
即满足none条件,输出null
换句话说,下划线的 ttype 既不被视为单词,也不被视为带引号的字符串。
ttype
的文档指定
After a call to the nextToken method, this field contains the type of the token just read. For a single character token, its value is the single character, converted to an integer. For a quoted string token, its value is the quote character. Otherwise, its value is one of the following: TT_WORD indicates that the token is a word. TT_NUMBER indicates that the token is a number. TT_EOL indicates that the end of line has been read. The field can only have this value if the eolIsSignificant method has been called with the argument true. TT_EOF indicates that the end of the input stream has been reached.
The initial value of this field is -4.
请注意,-4 值等于 TT_NOTHING。
要将下划线识别为单词,您可以使用tokenizer.wordChars('_', '_');
wordChars is used to specify that all characters c in the range low <= c <= high are word constituents. A word token consists of a word constituent followed by zero or more word constituents or number constituents.
如果您希望下划线是一个普通的字符而不是单词字符,那么还有一个 method。
请注意,将“_”作为 wordChars 的两个分隔符都将允许下划线作为单词字符,因此您可能需要设置适合您需要的边界。
编辑: 为了回答您的评论,简而言之,下划线被视为标识符的一部分,这就是为什么它没有映射到任何东西,因此 return null .
如果您查看 StreamTokenizer class 的未记录的私有构造函数,您将更好地了解如何处理每个字符:
private StreamTokenizer() {
wordChars('a', 'z');
wordChars('A', 'Z');
wordChars(128 + 32, 255);
whitespaceChars(0, ' ');
commentChar('/');
quoteChar('"');
quoteChar('\'');
parseNumbers();
}
下划线是ASCII码95,不在范围内