Gensim 在句子中查找主题
Gensim find topics in sentences
我已经在语料库上训练了 LDA 算法,我想做的是为每个句子获取它对应的主题,以便在算法发现的内容和标签之间进行比较我有。
我试过下面的代码,但结果很糟糕我发现了很多主题17(可能是体积的25%,应该接近5%)
感谢您的帮助
# text lemmatized: list of string lemmatized
dico = Dictionary(texts_lemmatized)
corpus_lda = [dico.doc2bow(text) for text in texts_lemmatized]
lda_ = LdaModel(corpus_lda, num_topics=18)
df_ = pd.DataFrame([])
data = []
# theme_commentaire = label of the string
for i in range(0, len(theme_commentaire)):
# lda_.get_document_topics() gives the distribution of all topic for a specific sentence
algo = max(lda_.get_document_topics(corpus_lda[i]))[0]
human = theme_commentaire[i]
data.append([str(algo), human])
cols = ['algo', 'human']
df_ = pd.DataFrame(data, columns=cols)
df_.head()
已在评论中解决:
I've found my problem though, It's the max() function, it operates on the key value of my list of tuple
[(num_topics, probability)] so basically I'll get 17 most of the time because it's the biggest key. – glouis
我已经在语料库上训练了 LDA 算法,我想做的是为每个句子获取它对应的主题,以便在算法发现的内容和标签之间进行比较我有。
我试过下面的代码,但结果很糟糕我发现了很多主题17(可能是体积的25%,应该接近5%)
感谢您的帮助
# text lemmatized: list of string lemmatized
dico = Dictionary(texts_lemmatized)
corpus_lda = [dico.doc2bow(text) for text in texts_lemmatized]
lda_ = LdaModel(corpus_lda, num_topics=18)
df_ = pd.DataFrame([])
data = []
# theme_commentaire = label of the string
for i in range(0, len(theme_commentaire)):
# lda_.get_document_topics() gives the distribution of all topic for a specific sentence
algo = max(lda_.get_document_topics(corpus_lda[i]))[0]
human = theme_commentaire[i]
data.append([str(algo), human])
cols = ['algo', 'human']
df_ = pd.DataFrame(data, columns=cols)
df_.head()
已在评论中解决:
I've found my problem though, It's the max() function, it operates on the key value of my list of tuple [(num_topics, probability)] so basically I'll get 17 most of the time because it's the biggest key. – glouis