读取 txt 文件并记录它的页面 - JAVA

reading txt file and recording it's pages - JAVA

我是一名新手软件学生,这是我第一次来这里,如果我发错地方了,请见谅。我有一个作业,包括读取一个包含很多行的文本文件(其中 40 行构成一页),将其拆分为单词,并且对于每个单词,记录所有出现的地方以及它发生的所有页面。

关键是我只能使用链表(我可以创建自己的方法)和数组。我在这上面度过了一段美好的时光,但我只能存储拆分的单词,即便如此,我仍在努力记录单词的页数和频率的逻辑部分……我应该在哪里存储每个单词的页码?它应该在一个数组中吗?或者我应该创建一个 "Word" class 来存储事件和页码?如果是这样,我还应该创建一个 "Page" class 来管理它吗?老实说,我已经尝试了这两种方法,但 none 似乎对我有用,最后代码变得缓慢、混乱,我只是感到困惑 lol

已经感谢大家的帮助!!!

编辑:这是我对问题的阅读和拆分部分的看法。这次我决定将所有单词存储在一个 linkedList 中。这里我改为 3 行 = 1 页。我还在 end.Problem 处留下了几行文字,但我不知道在哪里以及如何跟踪每 3 行组的页码

public void loadBook(){  
    Path path1 = Paths.get("alice.txt");
    int countLines = 0;
    stopwords(); //another method to load the stopwords
    try (BufferedReader reader = Files.newBufferedReader(path1, Charset.defaultCharset())) {
        String line = null;
        while ((line = reader.readLine()) != null) {
            String[] split = line.split(" ");
            ++countLines;
            for (int i = 0; i < split.length; ++i){
                if (stopwords.notContains(split[i].toLowerCase())){
                    pages.add(split[i]);
                }
            }   
            if (countLines % 3 == 0) {
                countPages++;
                countLines = 0;
            }
        }
    } catch (IOException e) {
        System.err.format("Erro na leitura do arquivo: ", e);
    }
}

第 I 章掉入兔子洞

爱丽丝开始厌倦坐在她姐姐的椅子上 银行,无所事事:有一两次她偷看了 她姐姐正在读的书,但里面没有图片或对话 它,'and what is the use of a book,' 认为爱丽丝“没有图片或 对话?'

所以她正在考虑自己的想法(以及她所能做的,为了 炎热的一天让她感到非常困倦和愚蠢),无论是快乐 做一个菊花链是值得的 采摘雏菊,突然一只粉红色眼睛的白兔运行 在她身边。

没有什么比这更了不起的了;爱丽丝也不这么认为 听到兔子自言自语“哦,亲爱的!

首先我想到了一个class来封装你需要从文本中提取的数据。您需要单独的单词,对于每个单词,您需要计算该单词在整个文本中出现的次数以及该单词出现的页面列表。所以我写了下面的 WordRef class.

import java.util.LinkedList;
import java.util.Objects;

public class WordRef {
    /** Number of times 'word' occurs in the text. */
    private int occurrences;

    /** The actual word. */
    private String word;

    /** List of page numbers where 'word' appears. */
    private LinkedList<Integer> pages;

    /**
     * Creates and returns instance of this class.
     * 
     * @param word - the actual word.
     */
    public WordRef(String word) {
        Objects.requireNonNull(word, "null word");
        this.word = word;
        occurrences = 1;
        pages = new LinkedList<Integer>();
    }

    /** Increment the number of occurrences of 'word'. */
    public void addOccurrence() {
        occurrences++;
    }

    /**
     * Add 'page' to the list of pages containing 'word'.
     * 
     * @param page - number of page to add.
     */
    public void addPage(Integer page) {
        if (!pages.contains(page)) {
            pages.add(page);
        }
    }

    /**
     * @return Number of occurrences of 'word'.
     */
    public int getOccurrences() {
        return occurrences;
    }

    /**
     * @return The actual 'word'.
     */
    public String getWord() {
        return word;
    }

    /**
     * Two 'WordRef' instances are equal if they both contain the exact, same word.
     */
    public boolean equals(Object obj) {
        boolean equal = false;
        if (obj != null) {
            Class<?> objClass = obj.getClass();
            if (objClass.equals(getClass())) {
                WordRef other = (WordRef) obj;
                String otherWord = other.getWord();
                equal = word.equals(otherWord);
            }
        }
        return equal;
    }

    /**
     * Equal 'WordRef' instances should each return the same hash code.
     */
    public int hashCode() {
        return word.hashCode();
    }

    /**
     * Returns a string representation of this instance.
     */
    public String toString() {
        return String.format("%s {%d} %s", word, occurrences, pages);
    }
}

请注意,LinkedList 的元素必须是对象,因此使用 Integer 而不是 int,因为 int 是原语。另请注意,我们需要确定两个 WordRef 实例是否包含相同的单词。因此 class WordRef 包含方法 equals() 并且根据 javadoc 对于 class java.lang.Object,如果 class 覆盖方法 equals() 那么它也应该覆盖方法 hashCode().

现在是读取文本并对其进行处理的代码。在您的问题中,您将所有代码都放在了一个名为 loadBook() 的方法中。然而,为了创建一个 minimal, reproducible example,我单独写了一个 class,并将文本读取和处理代码放入方法 main() 以及一些辅助方法中。这是 class.

的代码
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class AliceTxt {
    private static final int PAGE = 3;
    private static final Pattern REGEX = Pattern.compile("\b\w+\b");

    private static LinkedList<WordRef> wordRefs;

    private static List<String> getWords(String line) {
        if (line == null) {
            line = "";
        }
        Matcher matcher = REGEX.matcher(line);
        List<String> words = new ArrayList<>();
        while (matcher.find()) {
            words.add(matcher.group());
        }
        return words;
    }

    private static void updateWordRefs(List<String> words, int page) {
        if (words != null) {
            for (String word : words) {
                WordRef wordRef = new WordRef(word);
                int index = wordRefs.indexOf(wordRef);
                if (index < 0) {
                    wordRefs.add(wordRef);
                }
                else {
                    wordRef = wordRefs.get(index);
                    wordRef.addOccurrence();
                }
                wordRef.addPage(Integer.valueOf(page));
            }
        }
    }

    public static void main(String[] args) {
        Path path1 = Paths.get("alice.txt");
        try (BufferedReader reader = Files.newBufferedReader(path1, Charset.defaultCharset())) {
            wordRefs = new LinkedList<>();
            String line = reader.readLine();
            int countLines = 0;
            int page;
            while (line != null) {
                page = (countLines / PAGE) + 1;
                if (line.length() > 0) {
                    // Don't count empty lines.
                    countLines++;
                }
                List<String> words = getWords(line);
                updateWordRefs(words, page);
                line = reader.readLine();
            }
            wordRefs.forEach(System.out::println);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

上面的 class 使用另一个 LinkedList 将文本中所有不同的词保存为单独的 WordRef 对象。请注意,在上面的代码中,单词区分大小写,这意味着 Soso 被视为单独的单词。如果你想让单词不区分大小写,即 Soso 应该被认为是同一个单词,那么使用下面的方法从 class java.util.regex.Pattern

private static final Pattern REGEX = Pattern.compile("\b\w+\b", Pattern.CASE_INSENSITIVE);

下面是 运行 以上代码的输出,根据我 中关于您希望输出如何显示以及您确认为正确描述的描述。
下面的每一行都以实际单词开头,然后是出现次数,然后是该单词在文本中出现的页码列表。参考classWordRef.

中的方法toString()
Alice {3} [1, 3]
was {4} [1, 2, 3]
beginning {1} [1]
to {4} [1, 3]
get {1} [1]
very {2} [1, 2]
tired {1} [1]
of {6} [1, 2, 3]
sitting {1} [1]
by {2} [1, 2]
her {5} [1, 2]
sister {2} [1]
on {1} [1]
the {9} [1, 2, 3]
bank {1} [1]
and {4} [1, 2]
having {1} [1]
nothing {2} [1, 3]
do {1} [1]
once {1} [1]
or {3} [1]
twice {1} [1]
she {3} [1, 2]
had {2} [1]
peeped {1} [1]
into {1} [1]
book {2} [1]
reading {1} [1]
but {1} [1]
it {3} [1, 3]
no {1} [1]
pictures {2} [1]
conversations {2} [1]
in {3} [1, 2, 3]
what {1} [1]
is {1} [1]
use {1} [1]
a {3} [1, 2]
thought {1} [1]
without {1} [1]
So {1} [2]
considering {1} [2]
own {1} [2]
mind {1} [2]
as {2} [2]
well {1} [2]
could {1} [2]
for {1} [2]
hot {1} [2]
day {1} [2]
made {1} [2]
feel {1} [2]
sleepy {1} [2]
stupid {1} [2]
whether {1} [2]
pleasure {1} [2]
making {1} [2]
daisy {1} [2]
chain {1} [2]
would {1} [2]
be {1} [2]
worth {1} [2]
trouble {1} [2]
getting {1} [2]
up {1} [2]
picking {1} [2]
daisies {1} [2]
when {1} [2]
suddenly {1} [2]
White {1} [2]
Rabbit {2} [2, 3]
with {1} [2]
pink {1} [2]
eyes {1} [2]
ran {1} [2]
close {1} [2]
There {1} [3]
so {2} [3]
VERY {2} [3]
remarkable {1} [3]
that {1} [3]
nor {1} [3]
did {1} [3]
think {1} [3]
much {1} [3]
out {1} [3]
way {1} [3]
hear {1} [3]
say {1} [3]
itself {1} [3]
Oh {1} [3]
dear {1} [3]