我不想通过将单词拆分为字母来删除停用词
I don't want to remove stop words by splitting words into letters
我正在编写这段代码来从我的文本中删除停用词。
问题 - 此代码非常适合删除停用词,但是当我的文本中出现像 ant、ide 这样的词时,问题就出现了,因为它删除了 ant 和 ide因为 ant 存在于 important 中,想要 ide 存在于 side. 但我不想将单词拆分成一个字母以删除停用词。
String sCurrentLine;
List<String> stopWordsofwordnet=new ArrayList<>();
FileReader fr=new FileReader("G:\stopwords.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null)
{
stopWordsofwordnet.add(sCurrentLine);
}
//out.println("<br>"+stopWordsofwordnet);
List<String> wordsList = new ArrayList<>();
String text = request.getParameter("textblock");
text=text.trim().replaceAll("[\s,;]+", " ");
String[] words = text.split(" ");
// wordsList.addAll(Arrays.asList(words));
for (String word : words) {
wordsList.add(word);
}
out.println("<br>");
//remove stop words here from the temp list
for (int i = 0; i < wordsList.size(); i++)
{
// get the item as string
for (int j = 0; j < stopWordsofwordnet.size(); j++)
{
if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase()))
{
out.println(wordsList.get(i)+" ");
wordsList.remove(i);
i--;
break;
}
}
}
out.println("<br>");
for (String str : wordsList) {
out.print(str+" ");
}
您的代码过于复杂,可以简化为:
// Load stop words from file
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\stopwords.txt")));
// Get text and split into words
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
text.replaceAll("[\s,;]+", " ").trim().split(" ")));
// Remove stop words from list of words
wordsList.removeAll(stopWords);
我正在编写这段代码来从我的文本中删除停用词。
问题 - 此代码非常适合删除停用词,但是当我的文本中出现像 ant、ide 这样的词时,问题就出现了,因为它删除了 ant 和 ide因为 ant 存在于 important 中,想要 ide 存在于 side. 但我不想将单词拆分成一个字母以删除停用词。
String sCurrentLine;
List<String> stopWordsofwordnet=new ArrayList<>();
FileReader fr=new FileReader("G:\stopwords.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null)
{
stopWordsofwordnet.add(sCurrentLine);
}
//out.println("<br>"+stopWordsofwordnet);
List<String> wordsList = new ArrayList<>();
String text = request.getParameter("textblock");
text=text.trim().replaceAll("[\s,;]+", " ");
String[] words = text.split(" ");
// wordsList.addAll(Arrays.asList(words));
for (String word : words) {
wordsList.add(word);
}
out.println("<br>");
//remove stop words here from the temp list
for (int i = 0; i < wordsList.size(); i++)
{
// get the item as string
for (int j = 0; j < stopWordsofwordnet.size(); j++)
{
if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase()))
{
out.println(wordsList.get(i)+" ");
wordsList.remove(i);
i--;
break;
}
}
}
out.println("<br>");
for (String str : wordsList) {
out.print(str+" ");
}
您的代码过于复杂,可以简化为:
// Load stop words from file
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\stopwords.txt")));
// Get text and split into words
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
text.replaceAll("[\s,;]+", " ").trim().split(" ")));
// Remove stop words from list of words
wordsList.removeAll(stopWords);