打印 txt 文件中单词出现的次数

Question

我正在尝试查找单词 "the" 在 txt 文件中出现的次数。使用下面的代码，当它应该是 4520 时，我一直将 0 作为我的输出。我使用定界符来分隔 "the"，但它似乎根本不计算它。当我使用 "[^a-zA-Z]+".

计算所有单词时，分隔符起作用

in.useDelimiter("[^the]+");
while (in.hasNext()) {
    String words = in.next();
    words = words.toLowerCase();
    wordCount++;
}
System.out.println("The total number of 'the' is " + theWord);

Answer 1

在Java9+中，您可以统计一个单词在文本文件中出现的次数，如下所示：

static long countWord(String filename, String word) throws IOException {
    Pattern p = Pattern.compile("\b" + Pattern.quote(word) + "\b", Pattern.CASE_INSENSITIVE);
    return Files.lines(Paths.get(filename)).flatMap(s -> p.matcher(s).results()).count();
}

测试

System.out.println(countWord("test.txt", "the"));

test.txt

The quick brown fox
jumps over the lazy dog

输出

Java 8 版本:

static int countWord(String filename, String word) throws IOException {
    Pattern p = Pattern.compile("\b" + Pattern.quote(word) + "\b", Pattern.CASE_INSENSITIVE);
    return Files.lines(Paths.get(filename)).mapToInt(s -> {
        int count = 0;
        for (Matcher m = p.matcher(s); m.find(); )
            count++;
        return count;
    }).sum();
}

Java 7 版本:

static int countWord(String filename, String word) throws IOException {
    Pattern p = Pattern.compile("\b" + Pattern.quote(word) + "\b", Pattern.CASE_INSENSITIVE);
    int count = 0;
    try (BufferedReader in = Files.newBufferedReader(Paths.get(filename), StandardCharsets.UTF_8)) {
        for (String line; (line = in.readLine()) != null; )
            for (Matcher m = p.matcher(line); m.find(); )
                count++;
    }
    return count;
}

更新

Java 7+ 版本的完整代码，没有使用方法，并且使用速度慢得多 Scanner，因为 OP 似乎有麻烦 copy/pasting 上面的方法进入他们的代码。

import java.io.File;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) throws Exception {
        int count = 0;
        try (Scanner in = new Scanner(new File("test.txt"))) {
            Pattern p = Pattern.compile("\bthe\b", Pattern.CASE_INSENSITIVE);
            while (in.hasNextLine())
                for (Matcher m = p.matcher(in.nextLine()); m.find(); )
                    count++;
        }
        System.out.println("The total number of 'the' is " + count);
    }
}

为了比较，使用此答案中第一种方法的完整版本为：

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) throws IOException {
        System.out.println("The total number of 'the' is " + countWord("test.txt", "the"));
    }
    static long countWord(String filename, String word) throws IOException {
        Pattern p = Pattern.compile("\b" + Pattern.quote(word) + "\b", Pattern.CASE_INSENSITIVE);
        return Files.lines(Paths.get(filename)).flatMap(s -> p.matcher(s).results()).count();
    }
}

Answer 2

使用\b(?i)(the)\b作为正则表达式，其中\b代表单词边界，i代表不区分大小写，(the)代表the作为所有的。请注意，[] 检查它所包含的单个字符，而不是整个所包含的文本。

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class Main {
    public static void main(String[] args) {
        Scanner in = null;
        try {
            in = new Scanner(new File("file.txt"));
            int wordCount = 0, len;
            while (in.hasNextLine()) {
                len = in.nextLine().split("\b(?i)(the)\b").length;
                wordCount = len == 0 ? wordCount + 1 : wordCount + len - 1;
            }
            in.close();
            System.out.println("The total number of 'the' is " + wordCount);
        } catch (FileNotFoundException e) {
            System.out.println("File does not exist");
        }
    }
}

输出：

The total number of 'the' is 5

file.txt的内容：

The cat jumped over the rat.
The is written as THE in capital letter.
He gave them the sword.

打印 txt 文件中单词出现的次数

Printing the number of times the word appears from a txt file

java

java.util.scanner