TF-IDF 向量可以在不同级别的输入标记(单词、字符、n-gram)上生成,我们应该使用哪一个?

TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams) which should we use?

一个。 Word Level TF-IDF : 表示不同文档中每个术语的 tf-idf 分数的矩阵。

b。 N-gram Level TF-IDF:N-gram 是 N 项的组合。此矩阵表示 N-grams

的 tf-idf 分数

c。 Character Level TF-IDF : 表示字符级别的 tf-idf 得分的矩阵

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)


# ngram level tf-idf N-gram Level TF-IDF : N-grams are the combination of N terms together. This 
Matrix representing tf-idf scores of N-grams
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2, 3), 
max_features=5000)
tfidf_vect_ngram.fit(trainDF['texts'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)


# characters level tf-idf Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the dataset
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2, 3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['texts'])
xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)
xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)

没有一个适合所有情况的正确答案。该方法将取决于数据的性质。

您应该使用官方文档中的GridSearchCV to recognize the best approach in your exact case. Here is a good example of the pipeline for text feature extraction