使用潜在语义分析确定正确的主题数量

Question

从下面的例子开始

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

我想了解是否有（也许在 scikit-learn 中）一种方法来选择最合适的主题数量。

在我的具体案例中，我选择了 2 个主题（任意），但我想了解 Python 中是否有一种方法可以推广到更大的案例（有更多文档和更多单词）并选择自动主题数。

感谢您的帮助。

Answer 1

您可以使用可能的成分数量范围计算解释方差。组件的最大数量是您词汇量的大小。

performance = []
test = range(1, bag_of_words.shape[1], 2)

for n in test:
    svd = TruncatedSVD(n_components=n)
    lsa = svd.fit(bag_of_words)
    performance.append(lsa.explained_variance_ratio_.sum())

fig = plt.figure(figsize=(15, 5))
plt.plot(test, performance, 'ro--')
plt.title('explained variance by n-components');

您可以在图中通过点之间的斜率看到每个添加的组件对模型性能的贡献程度，以及何时无法获得更多信息。

获取未添加更多信息的组件数

import numpy as np

test[np.array(performance).argmax()]

输出

有了elbow method就可以找到最大减幅前的分量数增加信息

test[np.abs(np.gradient(np.gradient(performance))).argmax()]

输出

使用潜在语义分析确定正确的主题数量

Determine the correct number of topics using latent semantic analysis

python

nlp

svd

topic-modeling

scikit-learn