计算语料库中所有文档的 LDA 主题权重
computing the weight of LDA topic for all the documents in the corpus
我计算了我的 LDA 模型,我检索了我的主题,现在我正在寻找计算语料库中每个主题的 weight/percentage 的方法。令人惊讶的是我找不到这样做的方法,到目前为止我的代码如下所示:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = stopwords.words('english')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
# doc_set = df1.body
doc_set = appended_data.body
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# add tokens to list
texts.append(stopped_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50)
ldamodel.save("model.lda0")
到目前为止,我在其他论坛上看到的是做以下事情:
from itertools import chain
print(type(doc_set))
print(len(doc_set))
for top in ldamodel.print_topics():
print(top)
print
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
#print(lda_corpus)
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
print(sum(scores))
print(len(scores))
threshold = sum(scores)/len(scores)
print(threshold)
cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,doc_set) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,doc_set) if i[2][1] > threshold]
但是我在集群二中得到错误:IndexError: list index out of range
。知道为什么吗?
您需要在 lda 函数中声明为零的最小概率:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50, minimum_probability=0)
此外,您可以通过以下方式获取所有文章的主题分布:
for i in range(len(doc_set)):
print(ldamodel[corpus[i]])
我计算了我的 LDA 模型,我检索了我的主题,现在我正在寻找计算语料库中每个主题的 weight/percentage 的方法。令人惊讶的是我找不到这样做的方法,到目前为止我的代码如下所示:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = stopwords.words('english')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
# doc_set = df1.body
doc_set = appended_data.body
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# add tokens to list
texts.append(stopped_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50)
ldamodel.save("model.lda0")
到目前为止,我在其他论坛上看到的是做以下事情:
from itertools import chain
print(type(doc_set))
print(len(doc_set))
for top in ldamodel.print_topics():
print(top)
print
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
#print(lda_corpus)
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
print(sum(scores))
print(len(scores))
threshold = sum(scores)/len(scores)
print(threshold)
cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,doc_set) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,doc_set) if i[2][1] > threshold]
但是我在集群二中得到错误:IndexError: list index out of range
。知道为什么吗?
您需要在 lda 函数中声明为零的最小概率:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50, minimum_probability=0)
此外,您可以通过以下方式获取所有文章的主题分布:
for i in range(len(doc_set)):
print(ldamodel[corpus[i]])