如何使用gensim主题建模来预测新文档?
How to use gensim topic modeling to predict new document?
我是 gensim 主题建模的新手。这是我的示例代码:
import nltk
nltk.download('stopwords')
import re
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim # don't skip this
import matplotlib.pyplot as plt
#%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
train=pd.DataFrame({'text':['find the most representative document for each topic',
'topic distribution across documents',
'to help with understanding the topic',
'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})
data = train.loc[:,'text'].values.tolist()
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=3)
到目前为止一切顺利。但是我想用 lda_model 来预测文本。我至少需要知道主题在文本上的分布以及所有主题与词的关系。
我认为预测对于lda来说是非常常见和重要的功能。但是我不知道在gensim中哪里可以找到这样的功能。一些答案说 doc_lda = model[doc_bow] 是预测 (Calculating topic distribution of an unseen document on GenSim)。但我不确定。
import pandas as pd
train=pd.DataFrame({'text':['find the most representative document for each topic',
'topic distribution across documents',
'to help with understanding the topic',
'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
#using your train data to train the model with 4 topics
data_words = list(sent_to_words(train['text']))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4)
# predicting new text which is in text dataframe
new_text_corpus = id2word.doc2bow(text['text'][0].split())
lda[new_text_corpus]
#op
Out[75]:
[(0, 0.5517368), (1, 0.38150477), (2, 0.032756805), (3, 0.03400166)]
这将帮助您:
new_doc=your processed text
new_doc_bow = dictionary.doc2bow(new_doc)
ldamodel.get_document_topics(new_doc_bow)
我是 gensim 主题建模的新手。这是我的示例代码:
import nltk
nltk.download('stopwords')
import re
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim # don't skip this
import matplotlib.pyplot as plt
#%matplotlib inline
# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
train=pd.DataFrame({'text':['find the most representative document for each topic',
'topic distribution across documents',
'to help with understanding the topic',
'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})
data = train.loc[:,'text'].values.tolist()
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=3)
到目前为止一切顺利。但是我想用 lda_model 来预测文本。我至少需要知道主题在文本上的分布以及所有主题与词的关系。
我认为预测对于lda来说是非常常见和重要的功能。但是我不知道在gensim中哪里可以找到这样的功能。一些答案说 doc_lda = model[doc_bow] 是预测 (Calculating topic distribution of an unseen document on GenSim)。但我不确定。
import pandas as pd
train=pd.DataFrame({'text':['find the most representative document for each topic',
'topic distribution across documents',
'to help with understanding the topic',
'one of the practical application of topic modeling is to determine']})
text=pd.DataFrame({'text':['how to find the optimal number of topics for topic modeling']})
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
#using your train data to train the model with 4 topics
data_words = list(sent_to_words(train['text']))
id2word = corpora.Dictionary(data_words)
corpus = [id2word.doc2bow(text) for text in data_words]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4)
# predicting new text which is in text dataframe
new_text_corpus = id2word.doc2bow(text['text'][0].split())
lda[new_text_corpus]
#op
Out[75]:
[(0, 0.5517368), (1, 0.38150477), (2, 0.032756805), (3, 0.03400166)]
这将帮助您:
new_doc=your processed text
new_doc_bow = dictionary.doc2bow(new_doc)
ldamodel.get_document_topics(new_doc_bow)