如何使用 tf-idf 对新文档进行分类？

Question

如果我使用 sklearn 中的 TfidfVectorizer 生成特征向量为：

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

然后我将如何生成特征向量来对新文档进行分类？由于您无法计算单个文档的 tf-idf。

提取特征名称是否正确：

feature_names = TfidfVectorizer.get_feature_names()

然后根据feature_names?

统计新文档的词频

但是这样我就得不到包含单词重要性信息的权重。

Answer 1

您需要保存 TfidfVectorizer 的实例，它会记住用于拟合它的词频和词汇。如果不使用 fit_transform，而是分别使用 fit 和 transform，这可能会使事情更清楚：

vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)

Answer 2

我宁愿使用带有 Latent Semantic Indexing 的 gensim 作为原始语料库的包装器：bow->tfidf->lsi

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

那么如果你需要继续训练：

new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space

其中语料库是词袋

你可以在他们的tutorials中读到：
LSI培训的独特之处在于，我们可以随时继续“培训”，只需提供更多培训文档即可。这是通过在称为在线培训的过程中对基础模型进行增量更新来完成的。由于此功能，输入文档流甚至可能是无限的——只需在新文档到达时不断向 LSI 提供新文档，同时将计算的转换模型用作只读！

如果你喜欢sci-kit，gensim也是compatible with numpy

如何使用 tf-idf 对新文档进行分类？

How to classify new documents with tf-idf?

python

text-analysis

text-mining

tf-idf

scikit-learn