使用数组中的单词从 .txt 文件中删除连词
Removing conjunctions from a .txt file using words in an array
我试图从 txt 文件中删除连词和标点符号。成功删除了标点符号,但保留了一些连词。这是我的代码:
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
string words = File.ReadAllText(@"C:\Users\...\Desktop\data_protection_law.txt").ToLower(new CultureInfo("en-US", false));
string[] punctuation = { ".", "!", "?", "–", "-", "-", "/", "_", ",", ";", ":", "(", ")", "[", "]", "“", "”", "\"", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
string[] con_art = { "the", "a", "an", "for", "and", "or", "nor", "but", "yet", "so", "of", "to", "in", "are", "is", "on", "be", "by", "we", "he", "that", "he", "that", "because", "as", "it", "about", "were", "i", "our", "they", "with", "these", "there", "then", "them" };
foreach (string s in punctuation)
{
words = words.Replace(s, "");
}
foreach (string s in con_art)
{
words = words.Replace(" " + s + " ", " ");
}
richTextBox1.Text = words;
}
}
为了保险起见,我在 richTextBox 中打印了文字。查原文的时候,发现连词删掉了一些,但没有全部删掉。
Here is the proof of the remaining conjunctions
Original Text File
我快疯了,我自己找了好几天的错误,但我找不到。
那么这段代码我的错误在哪里?
顺便说一句,我只是一个初学者,所以如果我犯了一个大错误,请不要生气:)
因为有时候单词两边没有被space包围。
替换失败的都在行首或行尾,这意味着换行不是space
我认为您需要更改搜索并完全替换样式;在这里使用正则表达式是最简单的
var rex = string.Join("|", con_art.Select(w => $@"\b{w}\b"));
words = Regex.Replace(words, rex, "", RegexOptions.IgnoreCase);
第一行代码将您的单词列表转换为类似
的字符串
\bthe\b|\ba\b|\ban\b|\bfor\b|\band\b|\bor\b|...
当由正则表达式引擎使用时,\b
表示“space、标点符号、换行符等非单词字符与字母、数字等单词字符之间的边界”;这有效地使搜索 the
、a
、an
、for
、and
等功能作为“仅整个单词”——你正在尝试的你的 spaces(这是行不通的,因为有时你的话没有被 spaces 包围)。
竖线|
表示“或”;通过提供“整个单词 'the' 或整个单词 'a' 或整个单词 'an' ...”的列表,这意味着您不必一遍又一遍地替换 ()在循环中
我试图从 txt 文件中删除连词和标点符号。成功删除了标点符号,但保留了一些连词。这是我的代码:
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
string words = File.ReadAllText(@"C:\Users\...\Desktop\data_protection_law.txt").ToLower(new CultureInfo("en-US", false));
string[] punctuation = { ".", "!", "?", "–", "-", "-", "/", "_", ",", ";", ":", "(", ")", "[", "]", "“", "”", "\"", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
string[] con_art = { "the", "a", "an", "for", "and", "or", "nor", "but", "yet", "so", "of", "to", "in", "are", "is", "on", "be", "by", "we", "he", "that", "he", "that", "because", "as", "it", "about", "were", "i", "our", "they", "with", "these", "there", "then", "them" };
foreach (string s in punctuation)
{
words = words.Replace(s, "");
}
foreach (string s in con_art)
{
words = words.Replace(" " + s + " ", " ");
}
richTextBox1.Text = words;
}
}
为了保险起见,我在 richTextBox 中打印了文字。查原文的时候,发现连词删掉了一些,但没有全部删掉。 Here is the proof of the remaining conjunctions
Original Text File
我快疯了,我自己找了好几天的错误,但我找不到。
那么这段代码我的错误在哪里? 顺便说一句,我只是一个初学者,所以如果我犯了一个大错误,请不要生气:)
因为有时候单词两边没有被space包围。
替换失败的都在行首或行尾,这意味着换行不是space
我认为您需要更改搜索并完全替换样式;在这里使用正则表达式是最简单的
var rex = string.Join("|", con_art.Select(w => $@"\b{w}\b"));
words = Regex.Replace(words, rex, "", RegexOptions.IgnoreCase);
第一行代码将您的单词列表转换为类似
的字符串\bthe\b|\ba\b|\ban\b|\bfor\b|\band\b|\bor\b|...
当由正则表达式引擎使用时,\b
表示“space、标点符号、换行符等非单词字符与字母、数字等单词字符之间的边界”;这有效地使搜索 the
、a
、an
、for
、and
等功能作为“仅整个单词”——你正在尝试的你的 spaces(这是行不通的,因为有时你的话没有被 spaces 包围)。
竖线|
表示“或”;通过提供“整个单词 'the' 或整个单词 'a' 或整个单词 'an' ...”的列表,这意味着您不必一遍又一遍地替换 ()在循环中