从 PySpark PipelineModel 中的各个阶段访问方法的任何方法?
Any way to access methods from individual stages in PySpark PipelineModel?
我创建了一个 PipelineModel
用于在 Spark 2.0 中执行 LDA(通过 PySpark API):
def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
"""
Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
any fitting until invoked by the caller.
Args:
minTokenLength:
minDF: minimum number of documents word is present in corpus
minTF: minimum number of times word is found in a document
numTopics:
seed:
pattern: regular expression to split words
Returns:
pipeline: class pyspark.ml.PipelineModel
"""
reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
return pipeline
我想使用经过训练的模型和 LDAModel.logPerplexity()
方法来计算数据集的困惑度,因此我尝试了 运行 以下操作:
try:
training = get_20_newsgroups_data(test_or_train='test')
pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
model = pipeline.fit(training) # train model on training data
testing = get_20_newsgroups_data(test_or_train='test')
perplexity = model.logPerplexity(testing)
pprint(perplexity)
这只会导致以下 AttributeError
:
'PipelineModel' object has no attribute 'logPerplexity'
我明白为什么会出现这个错误,因为 logPerplexity
方法属于 LDAModel
,而不属于 PipelineModel
,但我想知道是否有办法从那个方法访问该方法舞台。
管道中的所有转换器都存储在stages
属性中。提取stages
,取最后一个,就可以开始了:
model.stages[-1].logPerplexity(testing)
我遇到了 pipeline.stages 不起作用的问题 - pipeline.stages 被视为参数。
在这种情况下,使用
pipeline.getStages()
您将获得阶段列表,就像 pipeline.stages 在大多数情况下所做的那样。
我创建了一个 PipelineModel
用于在 Spark 2.0 中执行 LDA(通过 PySpark API):
def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'):
"""
Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do
any fitting until invoked by the caller.
Args:
minTokenLength:
minDF: minimum number of documents word is present in corpus
minTF: minimum number of times word is found in a document
numTopics:
seed:
pattern: regular expression to split words
Returns:
pipeline: class pyspark.ml.PipelineModel
"""
reTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern=pattern, minTokenLength=minTokenLength)
cntVec = CountVectorizer(inputCol=reTokenizer.getOutputCol(), outputCol="vectors", minDF=minDF, minTF=minTF)
lda = LDA(k=numTopics, seed=seed, optimizer="em", featuresCol=cntVec.getOutputCol())
pipeline = Pipeline(stages=[reTokenizer, cntVec, lda])
return pipeline
我想使用经过训练的模型和 LDAModel.logPerplexity()
方法来计算数据集的困惑度,因此我尝试了 运行 以下操作:
try:
training = get_20_newsgroups_data(test_or_train='test')
pipeline = create_lda_pipeline(numTopics=20, minDF=3, minTokenLength=5)
model = pipeline.fit(training) # train model on training data
testing = get_20_newsgroups_data(test_or_train='test')
perplexity = model.logPerplexity(testing)
pprint(perplexity)
这只会导致以下 AttributeError
:
'PipelineModel' object has no attribute 'logPerplexity'
我明白为什么会出现这个错误,因为 logPerplexity
方法属于 LDAModel
,而不属于 PipelineModel
,但我想知道是否有办法从那个方法访问该方法舞台。
管道中的所有转换器都存储在stages
属性中。提取stages
,取最后一个,就可以开始了:
model.stages[-1].logPerplexity(testing)
我遇到了 pipeline.stages 不起作用的问题 - pipeline.stages 被视为参数。 在这种情况下,使用
pipeline.getStages()
您将获得阶段列表,就像 pipeline.stages 在大多数情况下所做的那样。