有没有办法使用矩阵乘法从 gensim LDA 预训练模型推断未见文档的主题分布?

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

有没有办法在不使用 LDA_Model[unseenDoc] 语法的情况下使用预训练的 LDA 模型来获取未见文档的主题分布?我正在尝试将我的 LDA 模型实现到 Web 应用程序中,如果有一种方法可以使用矩阵乘法来获得类似的结果,那么我可以在 javascript.

中使用该模型

例如,我尝试了以下操作:

import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')


def Preprocesser(text_list):

    smallestWordSize = 3
    processedList = []

    for token in gensim.utils.simple_preprocess(text_list):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
            processedList.append(StemmAndLemmatize(token))

    return processedList

lda_model = models.LdaModel.load('LDAModel\GoldModel')  #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict")      #Load dictionary model was trained on

#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"

termTopicMatrix = lda_model.get_topics()    #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc)                #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc)       #Create bow using dictionary
dictSize = len(termTopicMatrix[0])          #Get length of terms in dictionary
fullDict = np.zeros(dictSize)               #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc]      #Get index of terms in bag of words
Second = [second[1] for second in bowDoc]   #Get frequency of term in bag of words
fullDict[First] = Second                    #Add word frequency to full dictionary


print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])

Output:
Matrix Multiplication: 
 [0.0283254  0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
 0.01558603 0.0370233  0.04648389 0.02887623 0.00776652 0.02147539
 0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
 0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
 0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
 0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax: 
 [(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]

在预训练模型中有 35 个主题和 1155 个单词。

在"Conventional Syntax"输出中,每个元组的第一个元素是主题的索引,第二个元素是主题的概率。在"Matrix Multiplication"版本中,probability是index,value是probability。显然两者不匹配。

例如,lda_model[unseenDoc]显示主题0的概率为0.07,但矩阵乘法表示该主题的概率为0.028。我是不是漏掉了一步?

您可以在安装中查看 LDAModelget_document_topics() 方法使用的完整源代码,或在线访问:

https://github.com/RaRe-Technologies/gensim/blob/e75f6c8e8d1dee0786b1b2cd5ef60da2e290f489/gensim/models/ldamodel.py#L1283

(它还在同一个文件中使用了 inference() 方法。)

它比您的代码做得更多 scaling/normalization/clipping,这可能是造成差异的原因。但是您应该能够逐行检查您的流程及其不同之处,以使步骤相匹配。

使用 gensim 代码的步骤作为创建并行 Javascript 代码的指导也不难,只要模型状态的正确部分,就可以重现其结果。