Spark 2.1.1：如何在 Spark 2.1.1 中使用已训练的 LDA 模型预测未见文档中的主题？

Question

我正在 pyspark (spark 2.1.1) 中针对客户评论数据集训练 LDA 模型。现在基于该模型，我想预测新的看不见的文本中的主题。

我正在使用以下代码制作模型

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F


path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)

data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()

docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()

# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic):  # specify vector id of words to actual words
   terms = topic[0]
   result = []
   for i in range(wordNumbers):
       term = vocabArray[terms[i]]
       result.append(term)
   return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
   print("Topic" + str(topic) + ":")
   for term in topics_final[topic]:
       print (term)
   print ('\n')

现在我有一个数据框，其中有一列包含新客户评论，我想预测他们属于哪个主题集群。我已经搜索了答案，主要推荐以下方式，如这里。

newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

但是，我收到以下错误：

'LDAModel' 对象没有属性 'toLocal'。它也没有 topicDistribution 属性。

spark 2.1.1 不支持这些属性吗？

那么还有其他方法可以从看不见的数据中推断出主题吗？

Answer 1

您将需要预处理新数据：

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])

# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后你可以把它作为一个函数传递给训练好的LDA。您只需要 bow_corpus:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

如果你想在 csv 中输出，试试这个：

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

希望对您有所帮助:)

Spark 2.1.1：如何在 Spark 2.1.1 中使用已训练的 LDA 模型预测未见文档中的主题？

Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

machine-learning

lda

apache-spark

pyspark