包装在管道对象中时更改 PySpark StringIndexer input_col 参数

Question

我正在构建一个 Pipeline 对象以使用 StringIndexer 对象对我的类别列进行编码。

indexers = [StringIndexer(inputCol='FirstName',
                                  outputCol='FirstName_new',
                                  handleInvalid='keep',
                                  stringOrderType='frequencyDesc').fit(df)]

pipeline = Pipeline(stages=indexers)

pipeline.write().overwrite().save(path)

我想在另一列上使用相同的管道对象（我有一个特定的用例，我需要它）。有什么方法可以更改 input_col 参数？

Answer 1

您可以使用setInputCol方法设置更改输入列名称。

indexers = [StringIndexer(inputCol='FirstName',
                                  outputCol='FirstName_new',
                                  handleInvalid='keep',
                                  stringOrderType='frequencyDesc')]

pipeline = Pipeline(stages=indexers)

>>> print(pipeline.getStages()[0].getInputCol())
FirstName

pipeline.getStages()[0].setInputCol('test')

>>> print(pipeline.getStages()[0].getInputCol())
'test'

请注意，您不应将 fit(df) 放入管道内 - 您应该使用管道适应数据，例如pipeline.fit(df).

包装在管道对象中时更改 PySpark StringIndexer input_col 参数

Change PySpark StringIndexer input_col param when wrapped in a Pipeline object

apache-spark

pyspark

apache-spark-ml

apache-spark-mllib