Guava Splitter.onPattern(..).split() 与 String.split(..) 有何不同?
How is Guava Splitter.onPattern(..).split() different from String.split(..)?
我最近利用 前瞻性正则表达式 的强大功能来拆分字符串:
"abc8".split("(?=\d)|\W")
如果将此表达式打印到控制台 returns:
[abc, 8]
对这个结果很满意,我想把它转移到 Guava 进一步开发,看起来像这样:
Splitter.onPattern("(?=\d)|\W").split("abc8")
令我惊讶的是,输出更改为:
[abc]
为什么?
当模式匹配空字符串时,Guava Splitter
似乎有一个错误。如果您尝试创建一个 Matcher
并打印出它匹配的内容:
Pattern pattern = Pattern.compile("(?=\d)|\W");
Matcher matcher = pattern.matcher("abc8");
while (matcher.find()) {
System.out.println(matcher.start() + "," + matcher.end());
}
您得到输出 3,3
,这使得它看起来与 8
匹配。因此它只是在那里分裂,结果只有 abc
.
您可以使用例如Pattern#split(String)
这似乎给出了正确的输出:
Pattern.compile("(?=\d)|\W").split("abc8")
您发现了一个错误!
System.out.println(s.split("abc82")); // [abc, 8]
System.out.println(s.split("abc8")); // [abc]
这是Splitter
用来实际拆分String
s (Splitter.SplittingIterator::computeNext
)的方法:
@Override
protected String computeNext() {
/*
* The returned string will be from the end of the last match to the
* beginning of the next one. nextStart is the start position of the
* returned substring, while offset is the place to start looking for a
* separator.
*/
int nextStart = offset;
while (offset != -1) {
int start = nextStart;
int end;
int separatorPosition = separatorStart(offset);
if (separatorPosition == -1) {
end = toSplit.length();
offset = -1;
} else {
end = separatorPosition;
offset = separatorEnd(separatorPosition);
}
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset >= toSplit.length()) {
offset = -1;
}
continue;
}
while (start < end && trimmer.matches(toSplit.charAt(start))) {
start++;
}
while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
end--;
}
if (omitEmptyStrings && start == end) {
// Don't include the (unused) separator in next split string.
nextStart = offset;
continue;
}
if (limit == 1) {
// The limit has been reached, return the rest of the string as the
// final item. This is tested after empty string removal so that
// empty strings do not count towards the limit.
end = toSplit.length();
offset = -1;
// Since we may have changed the end, we need to trim it again.
while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
end--;
}
} else {
limit--;
}
return toSplit.subSequence(start, end).toString();
}
return endOfData();
}
感兴趣的区域是:
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset >= toSplit.length()) {
offset = -1;
}
continue;
}
这个逻辑很有效,除非空匹配发生在 String
的末尾。如果空匹配 确实 出现在 String
的末尾,它将最终跳过该字符。这部分应该是这样的(注意 >=
-> >
):
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset > toSplit.length()) {
offset = -1;
}
continue;
}
我最近利用 前瞻性正则表达式 的强大功能来拆分字符串:
"abc8".split("(?=\d)|\W")
如果将此表达式打印到控制台 returns:
[abc, 8]
对这个结果很满意,我想把它转移到 Guava 进一步开发,看起来像这样:
Splitter.onPattern("(?=\d)|\W").split("abc8")
令我惊讶的是,输出更改为:
[abc]
为什么?
当模式匹配空字符串时,Guava Splitter
似乎有一个错误。如果您尝试创建一个 Matcher
并打印出它匹配的内容:
Pattern pattern = Pattern.compile("(?=\d)|\W");
Matcher matcher = pattern.matcher("abc8");
while (matcher.find()) {
System.out.println(matcher.start() + "," + matcher.end());
}
您得到输出 3,3
,这使得它看起来与 8
匹配。因此它只是在那里分裂,结果只有 abc
.
您可以使用例如Pattern#split(String)
这似乎给出了正确的输出:
Pattern.compile("(?=\d)|\W").split("abc8")
您发现了一个错误!
System.out.println(s.split("abc82")); // [abc, 8]
System.out.println(s.split("abc8")); // [abc]
这是Splitter
用来实际拆分String
s (Splitter.SplittingIterator::computeNext
)的方法:
@Override
protected String computeNext() {
/*
* The returned string will be from the end of the last match to the
* beginning of the next one. nextStart is the start position of the
* returned substring, while offset is the place to start looking for a
* separator.
*/
int nextStart = offset;
while (offset != -1) {
int start = nextStart;
int end;
int separatorPosition = separatorStart(offset);
if (separatorPosition == -1) {
end = toSplit.length();
offset = -1;
} else {
end = separatorPosition;
offset = separatorEnd(separatorPosition);
}
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset >= toSplit.length()) {
offset = -1;
}
continue;
}
while (start < end && trimmer.matches(toSplit.charAt(start))) {
start++;
}
while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
end--;
}
if (omitEmptyStrings && start == end) {
// Don't include the (unused) separator in next split string.
nextStart = offset;
continue;
}
if (limit == 1) {
// The limit has been reached, return the rest of the string as the
// final item. This is tested after empty string removal so that
// empty strings do not count towards the limit.
end = toSplit.length();
offset = -1;
// Since we may have changed the end, we need to trim it again.
while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {
end--;
}
} else {
limit--;
}
return toSplit.subSequence(start, end).toString();
}
return endOfData();
}
感兴趣的区域是:
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset >= toSplit.length()) {
offset = -1;
}
continue;
}
这个逻辑很有效,除非空匹配发生在 String
的末尾。如果空匹配 确实 出现在 String
的末尾,它将最终跳过该字符。这部分应该是这样的(注意 >=
-> >
):
if (offset == nextStart) {
/*
* This occurs when some pattern has an empty match, even if it
* doesn't match the empty string -- for example, if it requires
* lookahead or the like. The offset must be increased to look for
* separators beyond this point, without changing the start position
* of the next returned substring -- so nextStart stays the same.
*/
offset++;
if (offset > toSplit.length()) {
offset = -1;
}
continue;
}