TensorFlow 词嵌入模型 + LDA 传递给 LatentDirichletAllocation.fit 的数据中的负值

Question

在将生成的特征向量传递给 LDA 模型之前，我正在尝试使用来自 TensorFlow hub 的预训练 model 而不是频率向量化技术进行词嵌入。

我按照 TensorFlow 模型的步骤进行操作，但在将生成的特征向量传递给 LDA 模型时出现此错误：

Negative values in data passed to LatentDirichletAllocation.fit

这是我到目前为止实现的内容：

import pandas as pd
import matplotlib.pyplot as plt
import tensorflow_hub as hub

from sklearn.decomposition import LatentDirichletAllocation

embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50-with-normalization/1")
embeddings = embed(["cat is on the mat", "dog is in the fog"])
lda_model = LatentDirichletAllocation(n_components=2, max_iter=50)
lda = lda_model.fit_transform(embeddings)

我意识到 print(embeddings) 打印了一些负值，如下所示：

tf.Tensor(
[[ 0.16589954  0.0254965   0.1574857   0.17688066  0.02911299 -0.03092718
   0.19445257 -0.05709129 -0.08631689 -0.04391516  0.13032274  0.10905275
  -0.08515751  0.01056632 -0.17220995 -0.17925954  0.19556305  0.0802278
  -0.03247919 -0.49176937 -0.07767699 -0.03160921 -0.13952136  0.05959712
   0.06858718  0.22386682 -0.16653948  0.19412343 -0.05491862  0.10997339
  -0.15811177 -0.02576607 -0.07910853 -0.258499   -0.04206644 -0.20052543
   0.1705603  -0.15314153  0.0039225  -0.28694248  0.02468278  0.11069503
   0.03733957  0.01433943 -0.11048374  0.11931834 -0.11552787 -0.11110869
   0.02384969 -0.07074881]

但是，有解决办法吗？

Answer 1

由于LatentDirichletAllocation的fit函数不允许负数组，我建议你在embeddings上应用softplus。

这是代码片段：

import pandas as pd
import matplotlib.pyplot as plt
import tensorflow_hub as hub
from tensorflow.math import softplus

from sklearn.decomposition import LatentDirichletAllocation

embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50-with-normalization/1")
embeddings = softplus(embed(["cat is on the mat", "dog is in the fog"]))

lda_model = LatentDirichletAllocation(n_components=2, max_iter=50)
lda = lda_model.fit_transform(embeddings)

TensorFlow 词嵌入模型 + LDA 传递给 LatentDirichletAllocation.fit 的数据中的负值

TensorFlow word embedding model + LDA Negative values in data passed to LatentDirichletAllocation.fit

lda

topic-modeling

scikit-learn

tensorflow

word-embedding