Java 中基于词典的搜索优化

Question

我有一个Sentencesclass。此 class 的实例表示文本文件中的每个句子。

我正在阅读文件中的每个句子，并将该句子作为我的 Sentences class 的 instance。对于每个句子，我需要检查其中有多少停用词/功能词。

我有一个包含英文停用词的文本文件 (stopwords.txt)。

我应该如何设计我的程序，以便对于每个句子我都不必一遍又一遍地阅读 stopwords.txt 文件？相反，我应该保存此文件的内容（停用词）"somehow"，然后检查我句子中的哪些词是停用词。

我有很多句子，我需要这个程序尽可能快。

编辑：

我创建了一个停用词class

public class StopWords

我正在阅读此 class 中的 stopwords.txt 文件并将它们保存在 HashSet 中。

....    
while ((entries = br.readLine()) != null){
                    stopWordSet.add(entries.toLowerCase());
...

然后，我在 Sentences class:

中创建了一个 StopWords class 的实例

public class Sentences {
...
    private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}

我正在从文件中读取句子并创建 Sentences 的实例 class。这些句子中的每一个单词都保存在一个名为 wordList 的 ArrayList 中，并将其发送到 StopWords class 的 dealStopWord() 方法以检查哪些单词是停用词。最后，我使用 getStopWordCount() 方法获取停用词的数量：

stopList.dealStopWord(wordList);
            this.totalFunctionWords = stopList.getStopWordCount();

编辑：如果我将 stopList 变量设置为 Sentences class 的局部变量，那么对于每个句子，都会调用构造函数（即，为每个句子读取 stopwords.txt 文件）但它是比 stopList 变量是静态的情况快得多（即 stopwords.txt 只读一次）

编辑

StopWords.java class

    public class StopWords {

    //Instance variables
    private String stopWordFile = ""; // name of the stopword file
    private Set<String> stopWordSet; 
    private int count = 0; //number of stopwords found in a given sentence
    private String[] sortedStopWords;
    private ArrayList <String> noStopWordArray = new ArrayList <String> ();

    //Constructor: takes the file containing stopwords
    public StopWords (String fileName){
        System.out.println("Stoplist constructor called");
        this.stopWordFile = fileName;
        FileReader stopWordFile = null;
        try {
            stopWordFile = new FileReader(this.stopWordFile);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        BufferedReader br = new BufferedReader(stopWordFile);
        String entries;
        stopWordSet = new TreeSet<String>();
        try {
            while ((entries = br.readLine()) != null){
                stopWordSet.add(entries.toLowerCase());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        try {
            br.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        sortedStopWords = new String[stopWordSet.size()];
        int i = 0;
        Iterator<String> itr = stopWordSet.iterator();
        while (itr.hasNext()){
            sortedStopWords[i++] = itr.next();
        }//end while

    }//public StopWords (String fileName)

    //return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
    public void dealStopWord(ArrayList <String> wordArray){

        this.count = 0;
        String temp = "";
        int size = wordArray.size();
        for(int i = 0; i < size; i++){
            temp = wordArray.get(i).toLowerCase();
            int found = Arrays.binarySearch(sortedStopWords, temp);
            if(found >= 0){
                this.count++;
            }//end if
            else{
                this.noStopWordArray.add(wordArray.get(i));
            }

        }//while(itr.hasNext())     

    }

    public ArrayList <String> getNoStopWordArray(){

        return this.noStopWordArray;

    }//public ArrayList <String> getNoStopWordArray()

    public int getStopWordCount(){

        return this.count;

    }//public int getStopWordCount()

}//public class StopWords

Sentences.java 的一部分 class:

       public class Sentences { 
        static StopWords stopList = new StopWords("languageresources/stopword.txt");
    public void setFunctionAndContentWords(){
            //If I make stopList variable locally here, the code is much faster
            stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
            this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
            //...set up done.
        }// end method
}

这就是我创建 Sentences 实例的方式 class:

Sentences[] s = new Sentences[totalSentences]; //sentence object..
       for (int i = 0; i < totalSentences; i++){

                    System.out.println("Processing sentence # " + (i+1));


        s[i].setFunctionAndContentWords();
    }

Answer 1

也许你可以使用哈希集。在开始阅读句子之前，将所有停用词放入哈希集中。然后对于每个单词检查该单词是否是停用词，使用：

stopWordsHashSet.contains(word);

Answer 2

确保您的 StopWords 实例不会累积信息或被重置。我会让它完全无状态（没有计数器，尤其是没有不匹配的单词列表）。

这还有一个好处，就是您可以在多线程中使用它。

在你的情况下：

this.noStopWordArray.add(wordArray.get(i));

导致数组不断增长（在静态情况下这是一个更大的问题，因为您将数组重新用于多个句子）。

Java 中基于词典的搜索优化

Dictionary based search optimization in Java

java

optimization

dictionary

编辑：