在使用 scikit 的 LatentDirichletAllocation 训练时评估模型 class

Question

我正在scikit-learn中试验LatentDirichletAllocation() class，evaluate_every参数有如下描述。

How often to evaluate perplexity. Only used in fit method. set it to 0 or negative number to not evalute perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.

我将此参数设置为 2（默认值为 0）并发现训练时间有所增加，但我似乎无法在任何地方找到困惑度值。这些结果是否已保存，或者它们仅供模型用于确定何时停止？我希望使用困惑值来衡量我的模型的进度和学习曲线。

Answer 1

根据 source:

，它与 perp_tol 参数结合使用以评估收敛性，并且不会在迭代之间保存

for i in xrange(max_iter):

    # ...

    # check perplexity
    if evaluate_every > 0 and (i + 1) % evaluate_every == 0:
        doc_topics_distr, _ = self._e_step(X, cal_sstats=False,
                                            random_init=False,
                                            parallel=parallel)
        bound = self.perplexity(X, doc_topics_distr,
                                sub_sampling=False)
        if self.verbose:
            print('iteration: %d, perplexity: %.4f'
                    % (i + 1, bound))

        if last_bound and abs(last_bound - bound) < self.perp_tol:
            break
        last_bound = bound
    self.n_iter_ += 1

请注意，您可以通过 (1) 将行 self.saved_bounds = [] 添加到 __init__ 方法 (2) 添加 self.bounds.append(bound) 到上面的方法来轻松调整现有源来执行此操作，像这样：

if last_bound and abs(last_bound - bound) < self.perp_tol:
    break
last_bound = bound
self.bounds.append(bound)

根据您保存更新 class 的位置，您还必须调整文件顶部的导入以引用 scikit-learn.

中的完整模块路径

在使用 scikit 的 LatentDirichletAllocation 训练时评估模型 class

Evaluating the model as you train with scikit's LatentDirichletAllocation class

machine-learning

lda

unsupervised-learning

scikit-learn