gensim中ldaseqmodel的评估

Evaluation of ldaseqmodel in gensim

是否有可能像 "normal" lda 模型一样评估动态模型 (ldaseqmodel) 的困惑度和主题连贯性? 我知道这些值被打印到 logging.INFO 中,所以另一种方法是将 logging.INFO 保存到文本文件中以在模拟后搜索这些评估值。 如果方法 1(评估 ldaseqmodel 的代码)不存在,是否可以将 logging.INFO 保存到文本文件中? 这是我生成 ldaseqmodel 的代码:

from gensim import models, corpora
import csv
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Anzahl_Topics1      = 10                

Zeitabschnitte      = [16, 19, 44, 51, 84, 122, 216, 290, 385, 441, 477, 375, 390, 408, 428, 192, 38]

TDM_dateipfad = './1gramm/TDM_1gramm_1998_2014.csv'

dateiname_corpus = "./1gramm/corpus_DTM_1gramm.mm"

dateiname1_dtm  = "./1gramm/DTM_1gramm_10.model"

ids = {} 
corpus = [] 

with open(TDM_dateipfad, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quotechar='|') 
    for rownumber, row in enumerate(reader): 
        for index, field in enumerate(row):
            if index == 0:
                if rownumber > 0:
                    ids[rownumber-1] = field 
            else:
                if rownumber == 0:
                    corpus.append([])
                else:
                    corpus[index-1].append((rownumber-1, int(field))) 

corpora.MmCorpus.serialize(dateiname_corpus, corpus)

dtm1 = models.ldaseqmodel.LdaSeqModel(corpus=corpus, time_slice = Zeitabschnitte, id2word=ids, num_topics = Anzahl_Topics1, passes=1, chunksize=10000) 
dtm1.save(dateiname1_dtm)

你问的是两个截然不同的问题。

是否可以将 logging.INFO 保存到文本文件中?

是的。您可以使用此代码将日志发送到文件而不是控制台。 DEBUG 级别日志记录为您提供比 INFO 更详细的信息。

import logging
logging.basicConfig(level=logging.DEBUG, file='yourlogname.log')

您可能还希望将文件处理程序设置为在控制台中记录 INFO,并将 DEBUG 级别记录到文件中。有关详细信息,请参阅 python 文档 here

是否有可能使用困惑度和主题连贯性来评估 DTM?

是,使用 dtm_coherence - 请参阅 gensim documentation here - coherence is generally a more useful measure (in terms of "do humans understand this") than perplexity. You will have to do so for each time slice separately though. My recommendation, if you want to compare two models, say a 10- vs. 20-topic model, would be to loop over the time slices for each model, and graph the coherence scores to see if one is consistently better, for example. There is a nice tutorial in this DTM example from the gensim devs