在 TfidfVectorizer 中删除法语和英语的停用词

Question

我正在尝试在 TfidfVectorizer 中删除法语和英语的停用词。到目前为止，我只成功地从英语语言中删除了停用词。当我尝试为 stop_words 输入法语时，我收到一条错误消息，指出它不是内置的。

实际上，我收到以下错误消息：

ValueError: not a built-in stop list: french

我有一个包含 700 行法语和英语混合文本的文本文档。

我正在使用 Python 做这 700 行的集群项目。但是，我的集群出现了一个问题：我得到的集群充满了法语停用词，这扰乱了集群的效率。

我的问题如下：

有什么方法可以添加法语停用词或手动更新内置的英语停用词列表，以便摆脱这些不必要的词？

这是包含我的停用词代码的 TfidfVectorizer 代码：

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words='english',
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))

删除这些法语停用词将使我能够拥有代表我文档中重复出现的单词的集群。

如果对这个问题的相关性有任何疑问，我上周曾问过一个类似的问题。但是，它不相似，因为它不使用 TfidfVectorizer。

如有任何帮助，我们将不胜感激。谢谢。

Answer 1

根据我的经验，解决此问题的最简单方法是在预处理阶段手动删除停用词（同时从其他地方获取最常见的法语短语列表）。

此外，应该可以方便地检查哪些停用词在您的 text/model 中最常出现在英语和法语中（通过它们的出现或 idf）并将它们添加到您在预处理阶段排除的停用词中。

如果您更喜欢使用 tfidfvectorizer 内置方法删除单词，请考虑制作一个您希望包含法语和英语的停用词列表，并将它们作为

stopwords=[a,he,she,le,...]
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words=stopwords,analyzer=’word’,
                             use_idf=True, tokenizer=tokenize_and_stem)

重要的是，引用自documentation:

‘english’ is currently the only supported string value

因此，现在您必须手动添加一些停用词列表，您可以在网络上的任何地方找到这些停用词列表，然后根据您的主题进行调整，例如： stopwords

Answer 2

Igor Sharm 指出了手动执行操作的方法，但也许您也可以安装 stop-words package。然后，由于 TfidfVectorizer 允许列表作为 stop_words 参数，

from stop_words import get_stop_words

my_stop_word_list = get_stop_words('english') + get_stop_words('french')

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                             min_df=0.2, stop_words=my_stop_word_list,
                             use_idf=True, tokenizer=tokenize_and_stem, 
ngram_range=(1,3))

如果您只想包含一些单词，您还可以根据需要阅读和解析 french.txt file in the github project。

Answer 3

您可以使用 NLTK or Spacy, two super popular NLP libraries for Python. Since achultz has already added the snippet for using stop-words 库中的优秀停用词包，我将展示如何使用 NLTK 或 Spacy。

NLTK:

from nltk.corpus import stopwords

final_stopwords_list = stopwords.words('english') + stopwords.words('french')
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
  max_features=200000,
  min_df=0.2,
  stop_words=final_stopwords_list,
  use_idf=True,
  tokenizer=tokenize_and_stem,
  ngram_range=(1,3))

NLTK 总共会给你 334 个停用词。

斯帕西：

from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

final_stopwords_list = list(fr_stop) + list(en_stop)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
  max_features=200000,
  min_df=0.2,
  stop_words=final_stopwords_list,
  use_idf=True,
  tokenizer=tokenize_and_stem,
  ngram_range=(1,3))

Spacy 总共为您提供了 890 个停用词。

在 TfidfVectorizer 中删除法语和英语的停用词

Remove Stopwords in French AND English in TfidfVectorizer

python

nltk

stop-words

tfidfvectorizer

NLTK:

斯帕西：