Scanner.findAll() 和 Matcher.results() 对相同的输入文本和模式的工作方式不同

Question

我在使用正则表达式拆分属性字符串期间看到了这个有趣的事情。我找不到根本原因。

我有一个字符串，其中包含属性键=值对之类的文本。我有一个正则表达式，它根据 = 位置将字符串拆分为 keys/values 。它首先将 = 视为分割点。 Value里面也可以包含=。

我在 Java 中尝试使用两种不同的方法来做到这一点。

使用Scanner.findAll()方法

这与预期不符。它应该根据模式提取并打印所有键。但我发现它的行为很奇怪。我有一个键值对如下

    SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important .....}

应该提取的密钥是 SectionError.ErrorMessage= 但它也认为 errorlevel= 作为密钥。

有趣的一点是，如果我从传递的属性字符串中删除一个字符，它表现良好并且只提取 SectionError.ErrorMessage= 键。

使用Matcher.results()方法

这很好用。无论我们在属性字符串中输入什么都没有问题。

我试过的示例代码：

import java.util.Scanner;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

import static java.util.regex.Pattern.MULTILINE;

public class MessageSplitTest {

    static final Pattern pattern = Pattern.compile("^[a-zA-Z0-9._]+=", MULTILINE);

    public static void main(String[] args) {
        final String properties =
                "SectionOne.KeyOne=first value\n" + // removing one char from here would make the scanner method print expected keys
                        "SectionOne.KeyTwo=second value\n" +
                        "SectionTwo.UUIDOne=379d827d-cf54-4a41-a3f7-1ca71568a0fa\n" +
                        "SectionTwo.UUIDTwo=384eef1f-b579-4913-a40c-2ba22c96edf0\n" +
                        "SectionTwo.UUIDThree=c10f1bb7-d984-422f-81ef-254023e32e5c\n" +
                        "SectionTwo.KeyFive=hello-world-sample\n" +
                        "SectionThree.KeyOne=first value\n" +
                        "SectionThree.KeyTwo=second value additional text just to increase the length of the text in this value still not enough adding more strings here n there\n" +
                        "SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message}\n" +
                        "SectionFour.KeyOne=sixth value\n" +
                        "SectionLast.KeyOne=Country";

        printKeyValuesFromPropertiesUsingScanner(properties);
        System.out.println();
        printKeyValuesFromPropertiesUsingMatcher(properties);
    }

    private static void printKeyValuesFromPropertiesUsingScanner(String properties) {
        System.out.println("===Using Scanner===");
        try (Scanner scanner = new Scanner(properties)) {
            scanner
                    .findAll(pattern)
                    .map(MatchResult::group)
                    .forEach(System.out::println);
        }
    }

    private static void printKeyValuesFromPropertiesUsingMatcher(String properties) {
        System.out.println("===Using Matcher===");
        pattern.matcher(properties).results()
                .map(MatchResult::group)
                .forEach(System.out::println);

    }
}

输出打印：

===Using Scanner===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
errorlevel=
SectionFour.KeyOne=
SectionLast.KeyOne=

===Using Matcher===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
SectionFour.KeyOne=
SectionLast.KeyOne=

这可能是什么根本原因？ scanner 的 findAll 与 matcher 的工作方式不同吗？

如果需要更多信息，请告诉我。

Answer 1

Scanner 的文档中多次提到 "buffer" 这个词。这表明 Scanner 不知道它正在读取的整个字符串，并且一次只在缓冲区中保存一小部分。这是有道理的，因为 Scanners 也被设计为从流中读取，从流中读取所有内容可能需要很长时间（或永远！）并占用大量内存。

在Scanner的源码中，确实有一个CharBuffer:

// Internal buffer used to hold input
private CharBuffer buf;

由于字符串的长度和内容，扫描程序已决定加载所有内容直到...

SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very...
                          ^
                    somewhere here
(It could be anywhere in the word "errorlevel")

...进入缓冲区。然后，在读取了一半字符串之后，另一半字符串开始如下：

errorlevel=Warning {HelpMessage:This is very...

errorLevel= 现在是字符串的开头，导致模式匹配。

Related Bug?

Matcher 不使用缓冲区。它将匹配的整个字符串存储在字段中：

/**
 * The original string being matched.
 */
CharSequence text;

所以在 Matcher 中没有观察到这种行为。

Answer 2

没看错，这是Scanner的缓冲区没有包含整个字符串的问题。我们可以简化示例来具体触发问题：

static final Pattern pattern = Pattern.compile("^ABC.", Pattern.MULTILINE);
public static void main(String[] args) {
    String testString = "\nABC1\nXYZ ABC2\nABC3ABC4\nABC4";
    String properties = "X".repeat(1024 - testString.indexOf("ABC4")) + testString;

    String s1 = usingScanner(properties);
    System.out.println("Using Scanner: "+s1);
    String m = usingMatcher(properties);
    System.out.println("Using Matcher: "+m);

    if(!s1.equals(m)) System.out.println("mismatch");
    if(s1.equals(usingScannerNoStream(properties)))
        System.out.println("Not a stream issue");
}
private static String usingScanner(String source) {
    return new Scanner(source)
        .findAll(pattern)
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}
private static String usingScannerNoStream(String source) {
    Scanner s = new Scanner(source);
    StringJoiner sj = new StringJoiner(" + ");
    for(;;) {
        String match = s.findWithinHorizon(pattern, 0);
        if(match == null) return sj.toString();
        sj.add(match);
    }
}
private static String usingMatcher(String source) {
    return pattern.matcher(source).results()
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}

打印：

Using Scanner: ABC1 + ABC3 + ABC4 + ABC4
Using Matcher: ABC1 + ABC3 + ABC4
mismatch
Not a stream issue

此示例在前缀前加上 X 个字符，以使误报匹配的开头与缓冲区的大小对齐。 Scanner 的初始缓冲区大小为 1024，但在需要时可能会扩大。

由于 findAll 忽略了扫描器的分隔符，就像 findWithinHorizon 一样，此代码还表明使用 findWithinHorizon 手动循环会表现出相同的行为，换句话说，这不是一个使用的 Stream API 的问题。

由于 Scanner 会在需要时扩大缓冲区，我们可以通过使用匹配操作来解决这个问题，该操作在执行预期的匹配操作之前强制将整个内容读入缓冲区，例如

private static String usingScanner(String source) {
    Scanner s = new Scanner(source);
    s.useDelimiter("(?s).*").hasNext();
    return s
        .findAll(pattern)
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}

这个特定的 hasNext() 带有一个消耗整个字符串的定界符，将强制完全缓冲字符串，而不推进位置。随后的 findAll() 操作忽略分隔符和 hasNext() 检查的结果，但由于缓冲区已完全填满，因此不会再遇到此问题。

当然，这破坏了解析实际流时Scanner的优势。

Scanner.findAll() 和 Matcher.results() 对相同的输入文本和模式的工作方式不同

Scanner.findAll() and Matcher.results() work differently for same input text and pattern

java

regex

pattern-matching

java.util.scanner

java-9