使用数组中的单词从 .txt 文件中删除连词

Removing conjunctions from a .txt file using words in an array

我试图从 txt 文件中删除连词和标点符号。成功删除了标点符号,但保留了一些连词。这是我的代码:

public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            string words = File.ReadAllText(@"C:\Users\...\Desktop\data_protection_law.txt").ToLower(new CultureInfo("en-US", false));

            string[] punctuation = { ".", "!", "?", "–", "-", "-", "/", "_", ",", ";", ":", "(", ")", "[", "]", "“", "”", "\"", "1", "2", "3", "4", "5", "6", "7", "8", "9" }; 
            string[] con_art = { "the", "a", "an", "for", "and", "or", "nor", "but", "yet", "so", "of", "to", "in", "are", "is", "on", "be", "by", "we", "he", "that", "he", "that", "because", "as", "it", "about", "were", "i", "our", "they", "with", "these", "there", "then", "them" };

            foreach (string s in punctuation)
            {
                words = words.Replace(s, "");
            }

            foreach (string s in con_art)
            {
                words = words.Replace(" " + s + " ", " ");
            }

            richTextBox1.Text = words;
        }
        
    }

为了保险起见,我在 richTextBox 中打印了文字。查原文的时候,发现连词删掉了一些,但没有全部删掉。 Here is the proof of the remaining conjunctions

Original Text File

我快疯了,我自己找了好几天的错误,但我找不到。

那么这段代码我的错误在哪里? 顺便说一句,我只是一个初学者,所以如果我犯了一个大错误,请不要生气:)

因为有时候单词两边没有被space包围。

替换失败的都在行首或行尾,这意味着换行不是space

我认为您需要更改搜索并完全替换样式;在这里使用正则表达式是最简单的

var rex = string.Join("|", con_art.Select(w => $@"\b{w}\b"));
words = Regex.Replace(words, rex, "", RegexOptions.IgnoreCase);

第一行代码将您的单词列表转换为类似

的字符串
\bthe\b|\ba\b|\ban\b|\bfor\b|\band\b|\bor\b|...

当由正则表达式引擎使用时,\b 表示“space、标点符号、换行符等非单词字符与字母、数字等单词字符之间的边界”;这有效地使搜索 theaanforand 等功能作为“仅整个单词”——你正在尝试的你的 spaces(这是行不通的,因为有时你的话没有被 spaces 包围)。

竖线|表示“或”;通过提供“整个单词 'the' 或整个单词 'a' 或整个单词 'an' ...”的列表,这意味着您不必一遍又一遍地替换 ()在循环中