Tf-idf vectorizer 在 char_wb 的特征词中有空格?

Tf-idf vectorizer has whitespaces in feature words with char_wb?

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

想知道为什么我的功能中有空格,例如 'chaft '

我怎样才能避免这种情况?我需要自己对它进行标记化和预处理吗?

使用analyzer='word'

当我们使用 char_wb 时,矢量化器会填充白色 space,因为它不会对使用 character_n_grams.

检查的单词进行标记化

根据Documentation:

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

查看以下示例,了解

的用法
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出:

[(4, ' and'), (5, ' and '), (4, ' doc'), (5, ' docu'), (6, ' docum'), (4, ' fir'), (5, ' firs'), (6, ' first'), (4, ' is '), (4, ' one'), (5, ' one.'), (6, ' one. '), (4, ' sec'), (5, ' seco'), (6, ' secon'), (4, ' the'), (5, ' the '), (4, ' thi'), (5, ' thir'), (6, ' third'), (5, ' this'), (6, ' this '), (4, 'and '), (4, 'cond'), (5, 'cond '), (4, 'cume'), (5, 'cumen'), (6, 'cument'), (4, 'docu'), (5, 'docum'), (6, 'docume'), (4, 'econ'), (5, 'econd'), (6, 'econd '), (4, 'ent '), (4, 'ent.'), (5, 'ent. '), (4, 'ent?'), (5, 'ent? '), (4, 'firs'), (5, 'first'), (6, 'first '), (4, 'hird'), (5, 'hird '), (4, 'his '), (4, 'ird '), (4, 'irst'), (5, 'irst '), (4, 'ment'), (5, 'ment '), (5, 'ment.'), (6, 'ment. '), (5, 'ment?'), (6, 'ment? '), (4, 'ne. '), (4, 'nt. '), (4, 'nt? '), (4, 'ocum'), (5, 'ocume'), (6, 'ocumen'), (4, 'ond '), (4, 'one.'), (5, 'one. '), (4, 'rst '), (4, 'seco'), (5, 'secon'), (6, 'second'), (4, 'the '), (4, 'thir'), (5, 'third'), (6, 'third '), (4, 'this'), (5, 'this '), (4, 'umen'), (5, 'ument'), (6, 'ument '), (6, 'ument.'), (6, 'ument?')]