pyspark.sql.utils.IllegalArgumentException: '字段"features"不存在

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist

我正在尝试通过 SparkNLP 对文本数据执行主题建模和情感分析。我已经完成了数据集的所有预处理步骤,但在 LDA 中出现错误。

Error

计划是:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors

dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/cdh@psnet.com/Gourav/chap3/abcnews-date-text.csv')

get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)

remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)

counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)

idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)

lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)

IDF 之后的 DataFrame 架构:

根据documentation,LDA包含一个featuresCol参数,默认值为featuresCol='features',即保存实际特征的列的名称;根据您显示的模式,您的数据框中不存在这样的列,因此是预期的错误。

尚不清楚哪一列包含数据框中的特征 - get_featuresget_idf_feature(它们在您显示的示例中看起来相同);假设它是 get_idf_feature,您应该将 LDA 调用更改为:

lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)

Spark(包括 pyspark)ML API 与 scikit-learn 和类似框架相比具有截然不同的逻辑;其中一个区别确实是这些特征必须全部在相应数据框的单个列中。对于这个想法的一般演示,请参阅 中自己的答案(它是关于 K-Means,但逻辑是相同的)。