在sklearn TfidfVectorizer中执行停用词去除过程时?

When the stop word removal process is executed in sklearn TfidfVectorizer?

如果我将自定义停用词列表传递给 TfidfVectorizer, when will the stopwords be removed exactly? According to the documentation

stop_words : string {‘english’}, list, or None (default)

...

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

所以这个过程似乎发生在标记化之后,对吗?产生疑问是因为如果标记化还涉及词干提取,我认为存在错误地跳过(而不是删除)停用词的风险,因为在词干提取之后,它不再被识别。

so it seems that the process happens after the tokenization, am I right?

你是对的。 stop_words 在获得令牌后应用,并变成 n-grams 的序列,请参阅 feature_extraction/text.py。分词器在 pre-processing 之后立即接收文本,不涉及停用词。

默认分词器不会转换文本,但如果您提供自己的分词器来执行词干提取或类似操作,您也应该对停用词进行分词。或者,您可以在分词器函数内部进行过滤。