我不想通过将单词拆分为字母来删除停用词

Question

我正在编写这段代码来从我的文本中删除停用词。

问题 - 此代码非常适合删除停用词，但是当我的文本中出现像 ant、ide 这样的词时，问题就出现了，因为它删除了 ant 和 ide因为 ant 存在于 important 中，想要 ide 存在于 side. 但我不想将单词拆分成一个字母以删除停用词。

            String sCurrentLine;
            List<String> stopWordsofwordnet=new ArrayList<>();
            FileReader fr=new FileReader("G:\stopwords.txt");
            BufferedReader br= new BufferedReader(fr);
                while ((sCurrentLine = br.readLine()) != null)
                {
                    stopWordsofwordnet.add(sCurrentLine);
                }
                //out.println("<br>"+stopWordsofwordnet);
            List<String> wordsList = new ArrayList<>();
            
            String text = request.getParameter("textblock");
            text=text.trim().replaceAll("[\s,;]+", " ");
            String[] words = text.split(" ");

//            wordsList.addAll(Arrays.asList(words));
                for (String word : words) {
                wordsList.add(word);
                }
            out.println("<br>");

            //remove stop words here from the temp list
            for (int i = 0; i < wordsList.size(); i++) 
            {
            // get the item as string
            for (int j = 0; j < stopWordsofwordnet.size(); j++) 
            {
            if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase())) 
            {
                out.println(wordsList.get(i)+"&nbsp;");
                wordsList.remove(i);
                i--;
                break;
            }
            }
            }
            out.println("<br>");
            for (String str : wordsList) {
            out.print(str+" ");
            }

Answer 1

您的代码过于复杂，可以简化为：

// Load stop words from file
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\stopwords.txt")));

// Get text and split into words
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
        text.replaceAll("[\s,;]+", " ").trim().split(" ")));

// Remove stop words from list of words
wordsList.removeAll(stopWords);

我不想通过将单词拆分为字母来删除停用词

I don't want to remove stop words by splitting words into letters

java

nlp

servlets

stanford-nlp