从 C# 中的文本文件中删除停用词
Delete stopwords from text file in C#
我阅读了两个文本文件:第一个包含阿拉伯文文本,我将其拆分。第二个包含停用词。
我想从第一个文件中删除任何停用词(在第二个文件中),但我不知道该怎么做:
FileStream fs = new FileStream(@"H:\arabictext.txt", FileMode.Open);
StreamReader arab = new StreamReader(fs,Encoding.Default,true);
string artx = arab.ReadToEnd();
richTextBox1.Text = artx;
arab.Close();
char[] dele = {' ', ',', '.', '\t', ';','#','!' };
string[] words = richTextBox1.Text.Split(dele);
FileStream fsw = new FileStream("H:\arab.txt", FileMode.Create);
StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);
foreach (string s in words)
{
arabw.WriteLine(s);
}
如果我没理解错的话,您想从第一个文件中找到停用词并从第二个文件中删除这些停用词。
这是我的解决方法:
- 从第一个文件中通过split方法提取停用词
- 迭代从第一个文件中提取的单词,并在第二个文件的内容中用
String.Empty
替换它们。
- 保存文件
我将您的代码简化为以下代码:
// read file contents
var fileContent1 = System.IO.File.ReadAllText("file1.txt");
var fileContent2 = System.IO.File.ReadAllText("file2.txt");
// extract stop-words from first file
var words = fileContent1.Split(new char[] { ' ', ',', '.', '\t', ';', '#', '!' })
.Distinct();
// rmeove stop words in file2
foreach (var word in words)
fileContent2.Replace(word, string.Empty);
System.IO.File.WriteAllText("file2.txt", fileContent2);
我找到了我的问题的解决方案..
你有更好的解决方案吗?
char[] dele = { ' ', ',', '.', '\t', ';', '#', '!' };
using (TextWriter tw = new StreamWriter(@"H:\output.txt"))
{
using (StreamReader reader = new StreamReader("H:\arabictext.txt",Encoding.Default,true))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] stopWord = new string[] { "قد", "في", "بيت", "فواصل", "هي", "من","$","ُ","ِ","ُ","ّ","ٍ","ٌ","ْ","ً" };
foreach (string word in stopWord)
{
line = line.Replace(word, "");
}
tw.Write(line);
}
}
}
FileStream fs = new FileStream(@"H:\output.txt", FileMode.Open);
StreamReader arab = new StreamReader(fs,Encoding.Default,true);
string artx = arab.ReadToEnd();
arab.Close();
string[] words = artx.Split(dele);
FileStream fsw = new FileStream("H:\result.txt", FileMode.Create);
StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);
foreach (string s in words)
{
arabw.WriteLine(s);
}
arabw.Close();
arab.Close();
我阅读了两个文本文件:第一个包含阿拉伯文文本,我将其拆分。第二个包含停用词。 我想从第一个文件中删除任何停用词(在第二个文件中),但我不知道该怎么做:
FileStream fs = new FileStream(@"H:\arabictext.txt", FileMode.Open);
StreamReader arab = new StreamReader(fs,Encoding.Default,true);
string artx = arab.ReadToEnd();
richTextBox1.Text = artx;
arab.Close();
char[] dele = {' ', ',', '.', '\t', ';','#','!' };
string[] words = richTextBox1.Text.Split(dele);
FileStream fsw = new FileStream("H:\arab.txt", FileMode.Create);
StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);
foreach (string s in words)
{
arabw.WriteLine(s);
}
如果我没理解错的话,您想从第一个文件中找到停用词并从第二个文件中删除这些停用词。
这是我的解决方法:
- 从第一个文件中通过split方法提取停用词
- 迭代从第一个文件中提取的单词,并在第二个文件的内容中用
String.Empty
替换它们。 - 保存文件
我将您的代码简化为以下代码:
// read file contents
var fileContent1 = System.IO.File.ReadAllText("file1.txt");
var fileContent2 = System.IO.File.ReadAllText("file2.txt");
// extract stop-words from first file
var words = fileContent1.Split(new char[] { ' ', ',', '.', '\t', ';', '#', '!' })
.Distinct();
// rmeove stop words in file2
foreach (var word in words)
fileContent2.Replace(word, string.Empty);
System.IO.File.WriteAllText("file2.txt", fileContent2);
我找到了我的问题的解决方案.. 你有更好的解决方案吗?
char[] dele = { ' ', ',', '.', '\t', ';', '#', '!' };
using (TextWriter tw = new StreamWriter(@"H:\output.txt"))
{
using (StreamReader reader = new StreamReader("H:\arabictext.txt",Encoding.Default,true))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] stopWord = new string[] { "قد", "في", "بيت", "فواصل", "هي", "من","$","ُ","ِ","ُ","ّ","ٍ","ٌ","ْ","ً" };
foreach (string word in stopWord)
{
line = line.Replace(word, "");
}
tw.Write(line);
}
}
}
FileStream fs = new FileStream(@"H:\output.txt", FileMode.Open);
StreamReader arab = new StreamReader(fs,Encoding.Default,true);
string artx = arab.ReadToEnd();
arab.Close();
string[] words = artx.Split(dele);
FileStream fsw = new FileStream("H:\result.txt", FileMode.Create);
StreamWriter arabw = new StreamWriter(fsw,Encoding.Default);
foreach (string s in words)
{
arabw.WriteLine(s);
}
arabw.Close();
arab.Close();