如何阻止 java 拼写检查程序纠正重复的单词
how to stop a java spell checker program from correcting repetitive words
我已经实现了一个执行以下操作的程序:
- 将网页中的所有单词扫描成一个字符串(使用 jsoup)
- 过滤掉所有 HTML 标记和代码
- 将这些词放入拼写检查程序并提供建议
拼写检查程序将 dictionary.txt 文件加载到一个数组中,并将输入的字符串与字典中的单词进行比较。
我现在的问题是,当输入多次包含同一个词时,比如"teh program is teh worst",代码会打印出来
You entered 'teh', did you mean 'the'?
You entered 'teh', did you mean 'the'?
有时一个网站会一遍又一遍地包含多个单词,这会变得很乱。
如果可能的话,打印单词及其拼写错误的次数是完美的,但限制每个单词打印一次就足够了。
我的程序有几个方法和两个classes,但拼写检查方法如下:
注意:原始代码包含一些删除标点符号的 'if' 语句,但为了清楚起见,我已删除它们。
static boolean suggestWord;
public static String checkWord(String wordToCheck) {
String wordCheck;
String word = wordToCheck.toLowerCase();
if ((wordCheck = (String) dictionary.get(word)) != null) {
suggestWord = false; // no need to ask for suggestion for a correct
// word.
return wordCheck;
}
// If after all of these checks a word could not be corrected, return as
// a misspelled word.
return word;
}
临时编辑:根据要求,完整代码:
Class 1:
public class ParseCleanCheck {
static Hashtable<String, String> dictionary;// To store all the words of the
// dictionary
static boolean suggestWord;// To indicate whether the word is spelled
// correctly or not.
static Scanner urlInput = new Scanner(System.in);
public static String cleanString;
public static String url = "";
public static boolean correct = true;
/**
* PARSER METHOD
*/
public static void PageScanner() throws IOException {
System.out.println("Pick an english website to scan.");
// This do-while loop allows the user to try again after a mistake
do {
try {
System.out.println("Enter a URL, starting with http://");
url = urlInput.nextLine();
// This creates a document out of the HTML on the web page
Document doc = Jsoup.connect(url).get();
// This converts the document into a string to be cleaned
String htmlToClean = doc.toString();
cleanString = Jsoup.clean(htmlToClean, Whitelist.none());
correct = false;
} catch (Exception e) {
System.out.println("Incorrect format for a URL. Please try again.");
}
} while (correct);
}
/**
* SPELL CHECKER METHOD
*/
public static void SpellChecker() throws IOException {
dictionary = new Hashtable<String, String>();
System.out.println("Searching for spelling errors ... ");
try {
// Read and store the words of the dictionary
BufferedReader dictReader = new BufferedReader(new FileReader("dictionary.txt"));
while (dictReader.ready()) {
String dictInput = dictReader.readLine();
String[] dict = dictInput.split("\s"); // create an array of
// dictionary words
for (int i = 0; i < dict.length; i++) {
// key and value are identical
dictionary.put(dict[i], dict[i]);
}
}
dictReader.close();
String user_text = "";
// Initializing a spelling suggestion object based on probability
SuggestSpelling suggest = new SuggestSpelling("wordprobabilityDatabase.txt");
// get user input for correction
{
user_text = cleanString;
String[] words = user_text.split(" ");
int error = 0;
for (String word : words) {
if(!dictionary.contains(word)) {
checkWord(word);
dictionary.put(word, word);
}
suggestWord = true;
String outputWord = checkWord(word);
if (suggestWord) {
System.out.println("Suggestions for " + word + " are: " + suggest.correct(outputWord) + "\n");
error++;
}
}
if (error == 0) {
System.out.println("No mistakes found");
}
}
} catch (IOException e) {
e.printStackTrace();
System.exit(-1);
}
}
/**
* METHOD TO SPELL CHECK THE WORDS IN A STRING. IS USED IN SPELL CHECKER
* METHOD THROUGH THE "WORD" STRING
*/
public static String checkWord(String wordToCheck) {
String wordCheck;
String word = wordToCheck.toLowerCase();
if ((wordCheck = (String) dictionary.get(word)) != null) {
suggestWord = false; // no need to ask for suggestion for a correct
// word.
return wordCheck;
}
// If after all of these checks a word could not be corrected, return as
// a misspelled word.
return word;
}
}
还有第二个 class (SuggestSpelling.java),它包含一个概率计算器,但现在不相关,除非您计划 运行 自己编写代码。
使用 HashSet
检测重复项 -
Set<String> wordSet = new HashSet<>();
并存储输入句子的每个单词。如果在插入 HashSet
期间已经存在任何单词,则不要为该单词调用 checkWord(String wordToCheck)
。像这样的 -
String[] words = // split input sentence into words
for(String word: words) {
if(!wordSet.contains(word)) {
checkWord(word);
// do stuff
wordSet.add(word);
}
}
编辑
// ....
{
user_text = cleanString;
String[] words = user_text.split(" ");
Set<String> wordSet = new HashSet<>();
int error = 0;
for (String word : words) {
// wordSet is another data-structure. Its only for duplicates checking, don't mix it with dictionary
if(!wordSet.contains(word)) {
// put all your logic here
wordSet.add(word);
}
}
if (error == 0) {
System.out.println("No mistakes found");
}
}
// ....
你还有其他错误,就像你将 String wordCheck
作为 checkWord
的参数传递并再次在 checkWord()
中重新声明它 String wordCheck;
这是不正确的。请检查其他部分。
我已经实现了一个执行以下操作的程序:
- 将网页中的所有单词扫描成一个字符串(使用 jsoup)
- 过滤掉所有 HTML 标记和代码
- 将这些词放入拼写检查程序并提供建议
拼写检查程序将 dictionary.txt 文件加载到一个数组中,并将输入的字符串与字典中的单词进行比较。
我现在的问题是,当输入多次包含同一个词时,比如"teh program is teh worst",代码会打印出来
You entered 'teh', did you mean 'the'?
You entered 'teh', did you mean 'the'?
有时一个网站会一遍又一遍地包含多个单词,这会变得很乱。
如果可能的话,打印单词及其拼写错误的次数是完美的,但限制每个单词打印一次就足够了。
我的程序有几个方法和两个classes,但拼写检查方法如下:
注意:原始代码包含一些删除标点符号的 'if' 语句,但为了清楚起见,我已删除它们。
static boolean suggestWord;
public static String checkWord(String wordToCheck) {
String wordCheck;
String word = wordToCheck.toLowerCase();
if ((wordCheck = (String) dictionary.get(word)) != null) {
suggestWord = false; // no need to ask for suggestion for a correct
// word.
return wordCheck;
}
// If after all of these checks a word could not be corrected, return as
// a misspelled word.
return word;
}
临时编辑:根据要求,完整代码:
Class 1:
public class ParseCleanCheck {
static Hashtable<String, String> dictionary;// To store all the words of the
// dictionary
static boolean suggestWord;// To indicate whether the word is spelled
// correctly or not.
static Scanner urlInput = new Scanner(System.in);
public static String cleanString;
public static String url = "";
public static boolean correct = true;
/**
* PARSER METHOD
*/
public static void PageScanner() throws IOException {
System.out.println("Pick an english website to scan.");
// This do-while loop allows the user to try again after a mistake
do {
try {
System.out.println("Enter a URL, starting with http://");
url = urlInput.nextLine();
// This creates a document out of the HTML on the web page
Document doc = Jsoup.connect(url).get();
// This converts the document into a string to be cleaned
String htmlToClean = doc.toString();
cleanString = Jsoup.clean(htmlToClean, Whitelist.none());
correct = false;
} catch (Exception e) {
System.out.println("Incorrect format for a URL. Please try again.");
}
} while (correct);
}
/**
* SPELL CHECKER METHOD
*/
public static void SpellChecker() throws IOException {
dictionary = new Hashtable<String, String>();
System.out.println("Searching for spelling errors ... ");
try {
// Read and store the words of the dictionary
BufferedReader dictReader = new BufferedReader(new FileReader("dictionary.txt"));
while (dictReader.ready()) {
String dictInput = dictReader.readLine();
String[] dict = dictInput.split("\s"); // create an array of
// dictionary words
for (int i = 0; i < dict.length; i++) {
// key and value are identical
dictionary.put(dict[i], dict[i]);
}
}
dictReader.close();
String user_text = "";
// Initializing a spelling suggestion object based on probability
SuggestSpelling suggest = new SuggestSpelling("wordprobabilityDatabase.txt");
// get user input for correction
{
user_text = cleanString;
String[] words = user_text.split(" ");
int error = 0;
for (String word : words) {
if(!dictionary.contains(word)) {
checkWord(word);
dictionary.put(word, word);
}
suggestWord = true;
String outputWord = checkWord(word);
if (suggestWord) {
System.out.println("Suggestions for " + word + " are: " + suggest.correct(outputWord) + "\n");
error++;
}
}
if (error == 0) {
System.out.println("No mistakes found");
}
}
} catch (IOException e) {
e.printStackTrace();
System.exit(-1);
}
}
/**
* METHOD TO SPELL CHECK THE WORDS IN A STRING. IS USED IN SPELL CHECKER
* METHOD THROUGH THE "WORD" STRING
*/
public static String checkWord(String wordToCheck) {
String wordCheck;
String word = wordToCheck.toLowerCase();
if ((wordCheck = (String) dictionary.get(word)) != null) {
suggestWord = false; // no need to ask for suggestion for a correct
// word.
return wordCheck;
}
// If after all of these checks a word could not be corrected, return as
// a misspelled word.
return word;
}
}
还有第二个 class (SuggestSpelling.java),它包含一个概率计算器,但现在不相关,除非您计划 运行 自己编写代码。
使用 HashSet
检测重复项 -
Set<String> wordSet = new HashSet<>();
并存储输入句子的每个单词。如果在插入 HashSet
期间已经存在任何单词,则不要为该单词调用 checkWord(String wordToCheck)
。像这样的 -
String[] words = // split input sentence into words
for(String word: words) {
if(!wordSet.contains(word)) {
checkWord(word);
// do stuff
wordSet.add(word);
}
}
编辑
// ....
{
user_text = cleanString;
String[] words = user_text.split(" ");
Set<String> wordSet = new HashSet<>();
int error = 0;
for (String word : words) {
// wordSet is another data-structure. Its only for duplicates checking, don't mix it with dictionary
if(!wordSet.contains(word)) {
// put all your logic here
wordSet.add(word);
}
}
if (error == 0) {
System.out.println("No mistakes found");
}
}
// ....
你还有其他错误,就像你将 String wordCheck
作为 checkWord
的参数传递并再次在 checkWord()
中重新声明它 String wordCheck;
这是不正确的。请检查其他部分。