使用 bufferReader 将文本拆分为单词

splitting a text into words using bufferReader

我有一个问题正在解决。我必须使用 bufferedReader 将单词添加到树集中(并输出树集的大小),但问题是我无法通过编译器速度测试限制。文本仅包含字母和空格(可以是空行)。我必须找到一个新的解决方案,但似乎不是这个:

BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
Set<String> text = new TreeSet<String>();
String words[], line;
while ((line = read.readLine()) != null) {
    words = line.split("\s+");
    for (int i = 0; i < words.length && words[0].length() > 0; i++) {
        text.add(words[i]);
    }
}
System.out.println(text.size());

是否有任何其他“拆分”方法可以使用,以便编译器使用较少的“时间思考”?

根据您提供的假设,我会简单地将所有内容添加到集合中,最后从中删除不需要的值。这有望减少检查条件的时间(实际上并不多)

BufferedReader read = new BufferedReader(new InputStreamReader(System.in));
Set<String> text = new TreeSet<String>();
String words[], line;
while ((line = read.readLine()) != null) {
  words = line.split("\s+");
  for(String value: words) {
    text.add(value);
  }
}
text.remove(" ");
text.remove("");
text.remove(null);
System.out.println(text.size());

排队

words = line.split("\s+");

你按正则表达式拆分,这比按一个字符拆分(在我的机器上 5 次)要慢得多。 Java split String performances

如果单词只被一个单词隔开space,那么解决方法很简单

words = line.split(" ");

只需替换为这一行,您的代码就会 运行 更快。

如果单词之间可以隔几个space,那么在循环后面加这样一行

text.remove("");

并且仍然用 1 个字符拆分替换您的正则表达式拆分。

public class Test {
    public static void main(String[] args) throws IOException {
        // string contains 1, 2 and two spaces between 1 and 2. text size should be 2
        String txt = "1  2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n" +
            "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1";

        InputStream inpstr = new ByteArrayInputStream(txt.getBytes());

        BufferedReader read = new BufferedReader(new InputStreamReader(inpstr));
        Set<String> text = new TreeSet<>();
        String[] words;
        String line;
        long startTime = System.nanoTime();
        while ((line = read.readLine()) != null) {
            //words = line.split("\s+"); -- runs 5 times slower
            words = line.split(" ");
            for (int i = 0; i < words.length; i++) {
                text.add(words[i]);
            }
        }
        text.remove("");  // add only if words can be separated with multiple spaces

        long endTime = System.nanoTime();
        System.out.println((endTime - startTime) + " " + text.size());
    }
}

您也可以将 for loop 替换为

text.addAll(Arrays.asList(words));

您当然可以将 BufferedReader 流式传输到 TreeSet:

Collection<String> c = read.lines().flatMap(line -> Stream.of(line.split("\s+")).filter(word -> word.length() > 0)).collect(Collectors.toCollection(TreeSet::new));