Java 中基于词典的搜索优化
Dictionary based search optimization in Java
我有一个Sentences
class。此 class 的实例表示文本文件中的每个句子。
我正在阅读文件中的每个句子,并将该句子作为我的 Sentences
class 的 instance
。对于每个句子,我需要检查其中有多少停用词/功能词。
我有一个包含英文停用词的文本文件 (stopwords.txt
)。
我应该如何设计我的程序,以便对于每个句子我都不必一遍又一遍地阅读 stopwords.txt
文件?相反,我应该保存此文件的内容(停用词)"somehow",然后检查我句子中的哪些词是停用词。
我有很多句子,我需要这个程序尽可能快。
编辑:
我创建了一个停用词class
public class StopWords
我正在阅读此 class 中的 stopwords.txt 文件并将它们保存在 HashSet 中。
....
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
...
然后,我在 Sentences class:
中创建了一个 StopWords class 的实例
public class Sentences {
...
private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}
我正在从文件中读取句子并创建 Sentences 的实例 class。这些句子中的每一个单词都保存在一个名为 wordList 的 ArrayList 中,并将其发送到 StopWords class 的 dealStopWord() 方法以检查哪些单词是停用词。最后,我使用 getStopWordCount() 方法获取停用词的数量:
stopList.dealStopWord(wordList);
this.totalFunctionWords = stopList.getStopWordCount();
编辑:如果我将 stopList 变量设置为 Sentences class 的局部变量,那么对于每个句子,都会调用构造函数(即,为每个句子读取 stopwords.txt 文件)但它是比 stopList 变量是静态的情况快得多(即 stopwords.txt 只读一次)
编辑
StopWords.java class
public class StopWords {
//Instance variables
private String stopWordFile = ""; // name of the stopword file
private Set<String> stopWordSet;
private int count = 0; //number of stopwords found in a given sentence
private String[] sortedStopWords;
private ArrayList <String> noStopWordArray = new ArrayList <String> ();
//Constructor: takes the file containing stopwords
public StopWords (String fileName){
System.out.println("Stoplist constructor called");
this.stopWordFile = fileName;
FileReader stopWordFile = null;
try {
stopWordFile = new FileReader(this.stopWordFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
BufferedReader br = new BufferedReader(stopWordFile);
String entries;
stopWordSet = new TreeSet<String>();
try {
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
}
} catch (IOException e) {
e.printStackTrace();
}
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
sortedStopWords = new String[stopWordSet.size()];
int i = 0;
Iterator<String> itr = stopWordSet.iterator();
while (itr.hasNext()){
sortedStopWords[i++] = itr.next();
}//end while
}//public StopWords (String fileName)
//return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
public void dealStopWord(ArrayList <String> wordArray){
this.count = 0;
String temp = "";
int size = wordArray.size();
for(int i = 0; i < size; i++){
temp = wordArray.get(i).toLowerCase();
int found = Arrays.binarySearch(sortedStopWords, temp);
if(found >= 0){
this.count++;
}//end if
else{
this.noStopWordArray.add(wordArray.get(i));
}
}//while(itr.hasNext())
}
public ArrayList <String> getNoStopWordArray(){
return this.noStopWordArray;
}//public ArrayList <String> getNoStopWordArray()
public int getStopWordCount(){
return this.count;
}//public int getStopWordCount()
}//public class StopWords
Sentences.java 的一部分 class:
public class Sentences {
static StopWords stopList = new StopWords("languageresources/stopword.txt");
public void setFunctionAndContentWords(){
//If I make stopList variable locally here, the code is much faster
stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
//...set up done.
}// end method
}
这就是我创建 Sentences 实例的方式 class:
Sentences[] s = new Sentences[totalSentences]; //sentence object..
for (int i = 0; i < totalSentences; i++){
System.out.println("Processing sentence # " + (i+1));
s[i].setFunctionAndContentWords();
}
也许你可以使用哈希集。在开始阅读句子之前,将所有停用词放入哈希集中。然后对于每个单词检查该单词是否是停用词,使用:
stopWordsHashSet.contains(word);
确保您的 StopWords
实例不会累积信息或被重置。我会让它完全无状态(没有计数器,尤其是没有不匹配的单词列表)。
这还有一个好处,就是您可以在多线程中使用它。
在你的情况下:
this.noStopWordArray.add(wordArray.get(i));
导致数组不断增长(在静态情况下这是一个更大的问题,因为您将数组重新用于多个句子)。
我有一个Sentences
class。此 class 的实例表示文本文件中的每个句子。
我正在阅读文件中的每个句子,并将该句子作为我的 Sentences
class 的 instance
。对于每个句子,我需要检查其中有多少停用词/功能词。
我有一个包含英文停用词的文本文件 (stopwords.txt
)。
我应该如何设计我的程序,以便对于每个句子我都不必一遍又一遍地阅读 stopwords.txt
文件?相反,我应该保存此文件的内容(停用词)"somehow",然后检查我句子中的哪些词是停用词。
我有很多句子,我需要这个程序尽可能快。
编辑:
我创建了一个停用词class
public class StopWords
我正在阅读此 class 中的 stopwords.txt 文件并将它们保存在 HashSet 中。
....
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
...
然后,我在 Sentences class:
中创建了一个 StopWords class 的实例public class Sentences {
...
private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}
我正在从文件中读取句子并创建 Sentences 的实例 class。这些句子中的每一个单词都保存在一个名为 wordList 的 ArrayList 中,并将其发送到 StopWords class 的 dealStopWord() 方法以检查哪些单词是停用词。最后,我使用 getStopWordCount() 方法获取停用词的数量:
stopList.dealStopWord(wordList);
this.totalFunctionWords = stopList.getStopWordCount();
编辑:如果我将 stopList 变量设置为 Sentences class 的局部变量,那么对于每个句子,都会调用构造函数(即,为每个句子读取 stopwords.txt 文件)但它是比 stopList 变量是静态的情况快得多(即 stopwords.txt 只读一次)
编辑
StopWords.java class
public class StopWords {
//Instance variables
private String stopWordFile = ""; // name of the stopword file
private Set<String> stopWordSet;
private int count = 0; //number of stopwords found in a given sentence
private String[] sortedStopWords;
private ArrayList <String> noStopWordArray = new ArrayList <String> ();
//Constructor: takes the file containing stopwords
public StopWords (String fileName){
System.out.println("Stoplist constructor called");
this.stopWordFile = fileName;
FileReader stopWordFile = null;
try {
stopWordFile = new FileReader(this.stopWordFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
BufferedReader br = new BufferedReader(stopWordFile);
String entries;
stopWordSet = new TreeSet<String>();
try {
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
}
} catch (IOException e) {
e.printStackTrace();
}
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
sortedStopWords = new String[stopWordSet.size()];
int i = 0;
Iterator<String> itr = stopWordSet.iterator();
while (itr.hasNext()){
sortedStopWords[i++] = itr.next();
}//end while
}//public StopWords (String fileName)
//return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
public void dealStopWord(ArrayList <String> wordArray){
this.count = 0;
String temp = "";
int size = wordArray.size();
for(int i = 0; i < size; i++){
temp = wordArray.get(i).toLowerCase();
int found = Arrays.binarySearch(sortedStopWords, temp);
if(found >= 0){
this.count++;
}//end if
else{
this.noStopWordArray.add(wordArray.get(i));
}
}//while(itr.hasNext())
}
public ArrayList <String> getNoStopWordArray(){
return this.noStopWordArray;
}//public ArrayList <String> getNoStopWordArray()
public int getStopWordCount(){
return this.count;
}//public int getStopWordCount()
}//public class StopWords
Sentences.java 的一部分 class:
public class Sentences {
static StopWords stopList = new StopWords("languageresources/stopword.txt");
public void setFunctionAndContentWords(){
//If I make stopList variable locally here, the code is much faster
stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
//...set up done.
}// end method
}
这就是我创建 Sentences 实例的方式 class:
Sentences[] s = new Sentences[totalSentences]; //sentence object..
for (int i = 0; i < totalSentences; i++){
System.out.println("Processing sentence # " + (i+1));
s[i].setFunctionAndContentWords();
}
也许你可以使用哈希集。在开始阅读句子之前,将所有停用词放入哈希集中。然后对于每个单词检查该单词是否是停用词,使用:
stopWordsHashSet.contains(word);
确保您的 StopWords
实例不会累积信息或被重置。我会让它完全无状态(没有计数器,尤其是没有不匹配的单词列表)。
这还有一个好处,就是您可以在多线程中使用它。
在你的情况下:
this.noStopWordArray.add(wordArray.get(i));
导致数组不断增长(在静态情况下这是一个更大的问题,因为您将数组重新用于多个句子)。