从 asp.net c# 中的字符串中删除停用词

remove stopword from a String in asp.net c#

我在创建从字符串中删除停用词的代码时遇到问题。这是我的代码:

String Review="The portfolio is fine except for the fact that the last movement of sonata #6 is missing. What should one expect?";

string[] arrStopword = new string[] {"a", "i", "it", "am", "at", "on", "in", "to", "too", "very","of", "from", "here", "even", "the", "but", "and", "is","my","them", "then", "this", "that", "than", "though", "so", "are"};
StringBuilder sbReview = new StringBuilder(Review);
foreach (string word in arrStopword){
sbReview.Replace(word, "");}
Label1.Text = sbReview.ToString();

当运行Label1.Text = "The portfolo s fne except for fct tht lst movement st #6 s mssng. Wht should e expect? "

我希望它必须 return "portofolio fine except for fact last movement sonata #6 is missing. what should one expect?"

有人知道如何解决这个问题吗?

问题是您比较的是子字符串,而不是单词。您需要将原文拆分,删除项目,然后重新加入。

试试这个

List<string> words = Review.Split(" ").ToList();
foreach(string stopWord in arrStopWord)
    words.Remove(stopWord);
string result = String.Join(" ", words);

我能看到的唯一问题是它不能很好地处理标点符号,但你明白了总体思路。

您可以使用 LINQ 来解决这个问题。您首先需要使用 Split 函数将 string 转换为由 " "(space) 分隔的 stringlist,然后使用 Except 得到你的结果将包含的单词然后可以应用 string.Join

var newString = string.Join(" ", Review.Split(' ').Except(arrStopword));

您可以使用“a”、“I”等来确保程序只删除那些用作单词的单词(因此它们周围有 spaces)。只需将它们替换为 space 即可保持格式不变。

或者您可以使用 dotnet-stop-words package。 只需调用 RemoveStopWords 方法

(yourString).RemoveStopWords("en");