Re-fitting a saved scikit-learn model without some features not used - "ValueError: A given column is not a column of the dataframe"

Question

我需要使用较小的数据集重新拟合 scikit-learn 管道，而没有一些模型实际上未使用的特征。

（实际情况是我通过joblib保存它并加载到另一个我需要重新调整的文件中，因为它包含我制作的一些自定义转换器，但是添加所有功能会很痛苦因为这是一种不同的模型。但这并不重要，因为如果我在将模型保存到我第一次训练它的同一个文件之前重新拟合模型，也会发生同样的错误。

这是我的自定义转换器：

class TransformAdoptionFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        adoption_features = X.columns
        feats_munic = [feat for feat in adoption_features if '_munic' in feat]
        feats_adj_neigh = [feat for feat in adoption_features
                           if '_adj' in feat]
        feats_port = [feat for feat in adoption_features if '_port' in feat]

        feats_to_keep_all = feats_munic + feats_adj_neigh + feats_port
        feats_to_keep = [feat for feat in feats_to_keep_all
                         if 'tot_cumul' not in feat]
        
        return X[feats_to_keep]

这是我的管道：

full_pipeline = Pipeline([
    ('transformer', TransformAdoptionFeatures()),
    ('scaler', StandardScaler())
])

model = Pipeline([
    ("preparation", full_pipeline),
    ("regressor", ml_model)
])

其中 ml_model 是 scikit-learn 机器学习模型。在保存 model 时 full_pipeline 和 ml_model 都已经安装好了。（在实际模型中有一个 ColumnTransformer 中间步骤代表实际 full_pipeline，因为我需要为不同的列使用不同的转换器，但为了简洁起见，我只复制了重要的一个）。

问题： 我减少了我已经使用的数据集的特征数量以适合所有内容，删除了一些 TransformAdoptionFeatures() 中未考虑的特征（它们不进入要保留的功能）。然后，我尝试将模型重新拟合到具有减少特征的新数据集，但出现此错误：

Traceback (most recent call last):

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2889, in get_loc
    return self._engine.get_loc(casted_key)

  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc

  File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item

  File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 'tot_cumul_adoption_pr_y_munic'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 447, in _get_column_indices
    col_idx = all_columns.get_loc(col)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
    raise KeyError(key) from err

KeyError: 'tot_cumul_adoption_pr_y_munic'


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\giaco\sbp-abm\municipalities_abm\test.py", line 15, in <module>
    modelSBP = model.SBPAdoption(initial_year=start_year)

  File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 103, in __init__
    self._upload_ml_models(ml_clsf_folder, ml_regr_folder)

  File "C:\Users\giaco\sbp-abm\municipalities_abm\municipalities_abm\model.py", line 183, in _upload_ml_models
    self._ml_clsf.fit(clsf_dataset.drop('adoption_in_year', axis=1),

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 292, in _fit
    X, fitted_transformer = fit_transform_one_cached(

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\pipeline.py", line 740, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform
    self._validate_remainder(X)

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder
    cols.extend(_get_column_indices(X, columns))

  File "C:\Users\giaco\anaconda3\envs\mesa_geo_ml\lib\site-packages\sklearn\utils\__init__.py", line 454, in _get_column_indices
    raise ValueError(

ValueError: A given column is not a column of the dataframe

我不明白这个错误是什么原因造成的，我认为 scikit-learn 没有存储我传递的列的名称。

Answer 1

我发现了我的错误，它实际上是在使用 ColumnsTransformer 时，那也是唯一输入列名的地方。

我的错误很简单，我只是没有更新列的列表来应用每个转换来删除排除的特征的名称。

Re-fitting a saved scikit-learn model without some features not used - "ValueError: A given column is not a column of the dataframe"

Re-fitting a saved scikit-learn model without some features not used - "ValueError: A given column is not a column of the dataframe"

python

pipeline

training-data

scikit-learn