潜在狄利克雷分配的输入特征问题

problem with input features for latent dirichlet allocation

我正在尝试使用我的 LDA 模型进行预测。但是当我将一个字符串传递给它时,它会给出一个关于输入特征不匹配的错误。现在我的问题是如何让我的模型接受任何输入并仍然预测正确的主题。现在它需要 54777 作为输入。

型号:

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Article'])
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

预测

txt = ["The election of Donald Trump was a surprise to pollsters, pundits and, perhaps most of all, the Democratic Party."]
vectorizer = CountVectorizer()
txt_vectorized = vectorizer.fit_transform(txt)
predict = LDA.transform(txt_vectorized)
print(predict)

错误:

ValueError: X has 16 features, but LatentDirichletAllocation is expecting 54777 features as input.

此代码段存在三个问题。

  • 问题 1:max_dfmin_df 应该都是 int 或者都是 float.
  • 问题2:在预测的时候你必须使用相同的 CountVectorizer.
  • 问题3:在预测的时候你必须使用 transform 方法,而不是 fit_transform 方法 CountVectorizer.

这是一个示例代码,可以帮助您:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
cv = CountVectorizer()

训练模型:

from sklearn.decomposition import LatentDirichletAllocation

dtm = cv.fit_transform(corpus)
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

预测:

txt = ["This is a new document"]
txt_vectorized = cv.transform(txt)
predict = LDA.transform(txt_vectorized)
print(predict)