如何从管道中提取词汇
How to extract vocabulary from Pipeline
我可以通过以下方式从 CountVecotizerModel 中提取词汇
fl = StopWordsRemover(inputCol="words", outputCol="filtered")
df = fl.transform(df)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
model = cv.fit(df)
print(model.vocabulary)
以上代码将打印带有索引的词汇列表。
现在我已经创建了上面代码的管道如下:
rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered")
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures")
pipeline = Pipeline(stages=[rm_stop_words, count_freq])
model = pipeline.fit(dfm)
df = model.transform(dfm)
print(model.vocabulary) # This won't work as it's not CountVectorizerModel
会抛出如下错误
print(len(model.vocabulary))
AttributeError: 'PipelineModel' object has no attribute 'vocabulary'
那么如何从管道中提取模型属性呢?
与任何其他阶段属性一样,提取 stages
:
stages = model.stages
找到您感兴趣的(-s):
from pyspark.ml.feature import CountVectorizerModel
vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]
并获取所需的字段:
[v.vocabulary for v in vectorizers]
我可以通过以下方式从 CountVecotizerModel 中提取词汇
fl = StopWordsRemover(inputCol="words", outputCol="filtered")
df = fl.transform(df)
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures")
model = cv.fit(df)
print(model.vocabulary)
以上代码将打印带有索引的词汇列表。
现在我已经创建了上面代码的管道如下:
rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered")
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures")
pipeline = Pipeline(stages=[rm_stop_words, count_freq])
model = pipeline.fit(dfm)
df = model.transform(dfm)
print(model.vocabulary) # This won't work as it's not CountVectorizerModel
会抛出如下错误
print(len(model.vocabulary))
AttributeError: 'PipelineModel' object has no attribute 'vocabulary'
那么如何从管道中提取模型属性呢?
与任何其他阶段属性一样,提取 stages
:
stages = model.stages
找到您感兴趣的(-s):
from pyspark.ml.feature import CountVectorizerModel
vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)]
并获取所需的字段:
[v.vocabulary for v in vectorizers]