Sklearn Latent Dirichlet 分配是如何工作的?
How Sklearn Latent Dirichlet Allocation really Works?
我有一些文本,我正在使用 sklearn LatentDirichletAllocation 算法从文本中提取主题。
我已经使用 Keras 将文本转换为序列,我正在这样做:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X
:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics
:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
我的问题是,fit_transform
返回的确切内容是什么,我知道这应该是从文本中检测到的主要主题,但我无法将这些数字映射到索引,所以我无法看看这些序列是什么意思,我没有找到对实际发生的事情的解释,所以任何建议将不胜感激。
首先,一般性解释 - 将 LDiA 视为一种聚类算法,默认情况下,它将根据文本中单词的频率确定 10 个质心,并将更大的权重放在某些那些词比其他词更接近质心。在这种情况下,每个质心代表一个 'topic',其中主题未命名,但可以用形成每个集群时最主要的词来描述。
一般来说,您使用 LDA 所做的是:
- 让它告诉您给定文本的 10 个(或其他)主题是什么。
或
- 让它告诉您哪个 centroid/topic 一些新文本最接近
对于第二种情况,您的期望是 LDiA 将为 10 clusters/topics 中的每一个输出新文本的“分数”。最高分的索引是新文本所属的 cluster/topic 的索引。
我更喜欢gensim.models.LdaMulticore,但既然你用过sklearn.decomposition.LatentDirichletAllocation,我就用那个。
这是 运行 完成此过程的一些示例代码(取自 here)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(vectorizedX)
现在让我们试试新文本:
testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)
Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05 ,
0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])
#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology
看起来很合理 - 示例文本是关于教育的,第一个主题的词云是关于教育的。
下面的图片来自另一个数据集(ham vs spam SMS messages,所以只有两个可能的主题),我用 PCA 将其降维到 3 个维度,但如果图片有帮助,这两个(不同角度的相同数据)可能会大致了解 LDiA 的情况。 (图表来自 Latent Discriminant Analysis vs LDiA,但表示仍然相关)
虽然 LDiA 是一种无监督方法,但要在业务环境中实际使用它,您可能希望至少手动干预以提供对您的环境有意义的主题名称。例如为新闻聚合站点上的故事分配主题区域,在 ['Business'、'Sports'、'Entertainment' 等之间进行选择
为了进一步研究,也许 运行 通过这样的事情:
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
我有一些文本,我正在使用 sklearn LatentDirichletAllocation 算法从文本中提取主题。
我已经使用 Keras 将文本转换为序列,我正在这样做:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)
X
:
print(X)
# array([[0, 988, 233, 21, 42, 5436, ...],
[0, 43, 6526, 21, 566, 762, 12, ...]])
X_topics
:
print(X_topics)
# array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
7.91563331e-01],
[5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
3.22771464e-01]])
我的问题是,fit_transform
返回的确切内容是什么,我知道这应该是从文本中检测到的主要主题,但我无法将这些数字映射到索引,所以我无法看看这些序列是什么意思,我没有找到对实际发生的事情的解释,所以任何建议将不胜感激。
首先,一般性解释 - 将 LDiA 视为一种聚类算法,默认情况下,它将根据文本中单词的频率确定 10 个质心,并将更大的权重放在某些那些词比其他词更接近质心。在这种情况下,每个质心代表一个 'topic',其中主题未命名,但可以用形成每个集群时最主要的词来描述。
一般来说,您使用 LDA 所做的是:
- 让它告诉您给定文本的 10 个(或其他)主题是什么。
或 - 让它告诉您哪个 centroid/topic 一些新文本最接近
对于第二种情况,您的期望是 LDiA 将为 10 clusters/topics 中的每一个输出新文本的“分数”。最高分的索引是新文本所属的 cluster/topic 的索引。
我更喜欢gensim.models.LdaMulticore,但既然你用过sklearn.decomposition.LatentDirichletAllocation,我就用那个。
这是 运行 完成此过程的一些示例代码(取自 here)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random
n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += " ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]])
print(message)
print()
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'),
return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(vectorizedX)
现在让我们试试新文本:
testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)
Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05 ,
0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])
#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology
看起来很合理 - 示例文本是关于教育的,第一个主题的词云是关于教育的。
下面的图片来自另一个数据集(ham vs spam SMS messages,所以只有两个可能的主题),我用 PCA 将其降维到 3 个维度,但如果图片有帮助,这两个(不同角度的相同数据)可能会大致了解 LDiA 的情况。 (图表来自 Latent Discriminant Analysis vs LDiA,但表示仍然相关)
虽然 LDiA 是一种无监督方法,但要在业务环境中实际使用它,您可能希望至少手动干预以提供对您的环境有意义的主题名称。例如为新闻聚合站点上的故事分配主题区域,在 ['Business'、'Sports'、'Entertainment' 等之间进行选择
为了进一步研究,也许 运行 通过这样的事情: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24