为什么停用词不会在我的程序中被过滤

Question

我主要使用 NLTK 的停用词列表，就像代码显示的那样

from nltk.corpus import stopwords`
stopword_nltk=stopwords.words('french')
motoutil=['après', 'avant', 'avex', 'chez', '\ba\b', 'et', 'concernant', 'contre', 'dans', 'depuis', 'derrière', 'dès', 'devant', 'durant', 'en', 'entre', 'envers', 'hormis', 'hors', 'jusque', 'malgré', 'moyennant', 'nonobstant', 'outre', 'par', 'parmi pendant', 'pour', 'près', 'sans', 'sauf', 'selon', 'sous', 'suivant', 'sur', 'touchant', 'vers', 'via', 'tout','tous', 'toute', 'toutes', 'jusqu']
stopwords_list=stopword_nltk+motoutil

并不是因为我在 stopword_nltk 中添加了另一个列表，所以该程序不能满足我的需要。即使我删除了motoutil，它也不起作用。

这是我打算删除停用词的部分：

for line in f_in.readlines():
    new_line=re.sub('\W',' ', line.lower())
    list_word=new_line.split()
    for element in list_word:
        if element in stopwords_list:
            cleaned_line=re.sub(element, ' ', new_line)
            f_out_trameur.write(cleaned_line)
            f_out_cleaned.write(cleaned_line)

它有两个问题：

首先，列出的停用词不会全部删除，例如'et'。

其次，我还想删除单词 'de' 和 'ce' 但不删除单词中间的两个部分。例如，在摘录“madame monsieur le président de l'assemblée nationale”中，president 前面的 de 应该被清除，但 président 一词中的“de”不应该被清除，在我的实际脚本中，president 将是“prési nt” “

Answer 1

我认为您正在内部循环中创建和编写已清理的行，该循环遍历由 new_line.split() 生成的行中的标记吗？如果没有发现任何要清理的东西，它根本就没有写？

这将导致包含停用词的行的多个版本（每个版本都删除了停用词），而不包含停用词的行将被跳过。

我的建议是，因为您已经有了标记（您使用了 split()），所以您只需使用这些标记来编写新行，而不是替换新行中的标记。

这还允许您将停用词列表转换为一个集合，并使检查 if element in stopwords_list 更快，因为这通常是一个很大的列表，并且对于大量的单词可能会变慢。在使用 NLTK 停用词时，这几乎总是一种加快速度的好方法。

我还建议使用列表理解来避免过多的嵌套循环和条件并使其更具可读性，但这只是个人偏好。

from nltk.corpus import stopwords
stopword_nltk=stopwords.words('french')
motoutil=['après', 'avant', 'avex', 'chez', '\ba\b', 'et', 'concernant', 'contre', 'dans', 'depuis', 'derrière', 'dès', 'devant', 'durant', 'en', 'entre', 'envers', 'hormis', 'hors', 'jusque', 'malgré', 'moyennant', 'nonobstant', 'outre', 'par', 'parmi pendant', 'pour', 'près', 'sans', 'sauf', 'selon', 'sous', 'suivant', 'sur', 'touchant', 'vers', 'via', 'tout','tous', 'toute', 'toutes', 'jusqu']
stopwords_set=set(stopword_nltk+motoutil)

for line in f_in.readlines():
    new_line = re.sub('\W',' ', line.lower())
    list_word = [word for word in new_line.split() if word not in stopwords_set]
    cleaned_line = ' '.join(list_word)
    f_out_trameur.write(cleaned_line)
    f_out_cleaned.write(cleaned_line)

请注意，write() 不会添加换行符 \n，因此您可能需要添加这个（f_out_trameur.write(cleaned_line+'\n') 和 f_out_cleaned.write(cleaned_line+'\n')），具体取决于您希望要查看的输出文件。

为什么停用词不会在我的程序中被过滤

Why the stopwords won't be filtered in my program

python

nltk

stop-words

python-re