如何阻止 java 拼写检查程序纠正重复的单词

how to stop a java spell checker program from correcting repetitive words

我已经实现了一个执行以下操作的程序:

  1. 将网页中的所有单词扫描成一个字符串(使用 jsoup)
  2. 过滤掉所有 HTML 标记和代码
  3. 将这些词放入拼写检查程序并提供建议

拼写检查程序将 dictionary.txt 文件加载到一个数组中,并将输入的字符串与字典中的单词进行比较。

我现在的问题是,当输入多次包含同一个词时,比如"teh program is teh worst",代码会打印出来

You entered 'teh', did you mean 'the'?
You entered 'teh', did you mean 'the'?

有时一个网站会一遍又一遍地包含多个单词,这会变得很乱。

如果可能的话,打印单词及其拼写错误的次数是完美的,但限制每个单词打印一次就足够了。

我的程序有几个方法和两个classes,但拼写检查方法如下:

注意:原始代码包含一些删除标点符号的 'if' 语句,但为了清楚起见,我已删除它们。

static boolean suggestWord;

public static String checkWord(String wordToCheck) {
        String wordCheck;
        String word = wordToCheck.toLowerCase();

    if ((wordCheck = (String) dictionary.get(word)) != null) {
        suggestWord = false; // no need to ask for suggestion for a correct
                                // word.
        return wordCheck;
    }

    // If after all of these checks a word could not be corrected, return as
    // a misspelled word.
    return word;
}

临时编辑:根据要求,完整代码:

Class 1:

public class ParseCleanCheck {

        static Hashtable<String, String> dictionary;// To store all the  words of the
        // dictionary
        static boolean suggestWord;// To indicate whether the word is spelled
                                    // correctly or not.

        static Scanner urlInput = new Scanner(System.in);
        public static String cleanString;
        public static String url = "";
        public static boolean correct = true;


        /**
         * PARSER METHOD
         */
        public static void PageScanner() throws IOException {
            System.out.println("Pick an english website to scan.");

            // This do-while loop allows the user to try again after a mistake
            do {
                try {
                    System.out.println("Enter a URL, starting with http://");
                    url = urlInput.nextLine();
                    // This creates a document out of the HTML on the web page
                    Document doc = Jsoup.connect(url).get();
                    // This converts the document into a string to be cleaned
                    String htmlToClean = doc.toString();
                    cleanString = Jsoup.clean(htmlToClean, Whitelist.none());


                    correct = false;
                } catch (Exception e) {
                    System.out.println("Incorrect format for a URL. Please try again.");
                }
            } while (correct);
        }

        /**
         * SPELL CHECKER METHOD
         */
        public static void SpellChecker() throws IOException {
            dictionary = new Hashtable<String, String>();
            System.out.println("Searching for spelling errors ... ");

            try {
                // Read and store the words of the dictionary
                BufferedReader dictReader = new BufferedReader(new FileReader("dictionary.txt"));

                while (dictReader.ready()) {
                    String dictInput = dictReader.readLine();
                    String[] dict = dictInput.split("\s"); // create an array of
                                                            // dictionary words

                    for (int i = 0; i < dict.length; i++) {
                        // key and value are identical
                        dictionary.put(dict[i], dict[i]);
                    }
                }
                dictReader.close();
                String user_text = "";

                // Initializing a spelling suggestion object based on probability
                SuggestSpelling suggest = new SuggestSpelling("wordprobabilityDatabase.txt");

                // get user input for correction
                {

                    user_text = cleanString;
                    String[] words = user_text.split(" ");

                    int error = 0;

                    for (String word : words) {
                        if(!dictionary.contains(word)) {
                            checkWord(word);


                            dictionary.put(word, word);
                        }
                        suggestWord = true;
                        String outputWord = checkWord(word);

                        if (suggestWord) {
                            System.out.println("Suggestions for " + word + " are:  " + suggest.correct(outputWord) + "\n");
                            error++;
                        }
                    }

                    if (error == 0) {
                        System.out.println("No mistakes found");
                    }
                }

            } catch (IOException e) {
                e.printStackTrace();
                System.exit(-1);
            }
        }

        /**
         * METHOD TO SPELL CHECK THE WORDS IN A STRING. IS USED IN SPELL CHECKER
         * METHOD THROUGH THE "WORD" STRING
         */

        public static String checkWord(String wordToCheck) {
            String wordCheck;
            String word = wordToCheck.toLowerCase();

        if ((wordCheck = (String) dictionary.get(word)) != null) {
            suggestWord = false; // no need to ask for suggestion for a correct
                                    // word.
            return wordCheck;
        }

        // If after all of these checks a word could not be corrected, return as
        // a misspelled word.
        return word;
    }
    }

还有第二个 class (SuggestSpelling.java),它包含一个概率计算器,但现在不相关,除非您计划 运行 自己编写代码。

使用 HashSet 检测重复项 -

Set<String> wordSet = new HashSet<>();

并存储输入句子的每个单词。如果在插入 HashSet 期间已经存在任何单词,则不要为该单词调用 checkWord(String wordToCheck)。像这样的 -

String[] words = // split input sentence into words
for(String word: words) {
    if(!wordSet.contains(word)) {
        checkWord(word);
        // do stuff
        wordSet.add(word);
    }
}

编辑

// ....
{

    user_text = cleanString;
    String[] words = user_text.split(" ");
    Set<String> wordSet = new HashSet<>();

    int error = 0;

    for (String word : words) {
        // wordSet is another data-structure. Its only for duplicates checking, don't mix it with dictionary
        if(!wordSet.contains(word)) {

            // put all your logic here

            wordSet.add(word);
        }
    }

    if (error == 0) {
        System.out.println("No mistakes found");
    }
}
// .... 

你还有其他错误,就像你将 String wordCheck 作为 checkWord 的参数传递并再次在 checkWord() 中重新声明它 String wordCheck; 这是不正确的。请检查其他部分。