StackingClassifier 引发异常 'numpy.ndarray' 对象没有属性 'columns'

Question

我正在尝试在 Sklearn 中训练 StackingClassifier，但我将运行保留在这个错误中，其中 fit 方法似乎在抱怨我向它传递了 numpy 数组。据我所知，这就是 sklearn 中所有拟合方法的工作原理。我阅读并遵循了 the documentation 中的示例，并对其进行了扩展，以包含一个更复杂、更全面的管道，该管道将处理分类、有序、标量和文本数据。

对于冗长的代码示例，我深表歉意，但我觉得有必要提供一个完整的可重现示例。简单地将管道分解成它的组成估计器并单独测试它们不会引发任何异常，所以我认为错误以某种方式来自格式塔估计器。

Select 特点

categorical_data = [
    "race",
    "gender",
    "admission_type_id",
    "discharge_disposition_id",
    "admission_source_id",
    "insulin",
    "diabetesMed",
    "change",
    "payer_code",
    "A1Cresult",
    "metformin",
    "repaglinide",
    "nateglinide",
    "chlorpropamide",
    "glimepiride",
    "glipizide",
    "glyburide",
    "tolbutamide",
    "pioglitazone",
    "rosiglitazone",
    "acarbose",
    "miglitol",
    "tolazamide",
    "glyburide.metformin",
    "glipizide.metformin",    
]

ordinal_data = [
    "age"
]

scalar_data = [
    "num_medications",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "number_outpatient",
    "number_emergency",
    "number_inpatient",
    "number_diagnoses",
]

text_data = [
    "diag_1_desc",
    "diag_2_desc",
    "diag_3_desc"
]

创建列转换器

impute_trans = compose.make_column_transformer(
    (
        impute.SimpleImputer(
            strategy="constant",
            fill_value="missing"
        ),
        categorical_data
    )
)

encode_trans = compose.make_column_transformer(
    (
        preprocessing.OneHotEncoder(
            sparse=False,
            handle_unknown="ignore"
        ),
        categorical_data
    ),
    (
        preprocessing.OrdinalEncoder(),
        ordinal_data
    )
)

scalar_trans = compose.make_column_transformer(
    (preprocessing.StandardScaler(), scalar_data),
)

text_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
)

创建估算器

cat_pre_pipe = make_pipeline(impute_trans, encode_trans)

logreg = LogisticRegression(
    solver = "saga",
    penalty="elasticnet",
    l1_ratio=0.5,
    max_iter=1000
)

text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_pre_pipe, logreg)

estimators = [
    ("cat", cat_pipe),
    ("text", text_pipe),
    ("scalar", scalar_pipe)
]

创建堆叠分类器

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=logreg
)

diabetes_data = pd.read_csv("8k_diabetes.csv", delimiter=',')

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        preprocess_dataframe(diabetes_data[text_data]),
        diabetes_data[categorical_data + scalar_data]
    ], axis=1),
    diabetes_data["readmitted"].astype(int)
)

stack_clf.fit(x_train, y_train)

完成堆栈跟踪

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:409, in _get_column_indices(X, key)
    408 try:
--> 409     all_columns = X.columns
    410 except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 stack_clf.fit(x_train, y_train)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:488, in StackingClassifier.fit(self, X, y, sample_weight)
    486 self._le = LabelEncoder().fit(y)
    487 self.classes_ = self._le.classes_
--> 488 return super().fit(X, self._le.transform(y), sample_weight)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:158, in _BaseStacking.fit(self, X, y, sample_weight)
    153 stack_method = [self.stack_method] * len(all_estimators)
    155 # Fit the base estimators on the whole training data. Those
    156 # base estimators will be used in transform, predict, and
    157 # predict_proba. They are exposed publicly.
--> 158 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
    159     delayed(_fit_single_estimator)(clone(est), X, y, sample_weight)
    160     for est in all_estimators
    161     if est != "drop"
    162 )
    164 self.named_estimators_ = Bunch()
    165 est_fitted_idx = 0

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:1043, in Parallel.__call__(self, iterable)
   1034 try:
   1035     # Only set self._iterating to True if at least a batch
   1036     # was dispatched. In particular this covers the edge
   (...)
   1040     # was very quick and its callback already dispatched all the
   1041     # remaining jobs.
   1042     self._iterating = False
-> 1043     if self.dispatch_one_batch(iterator):
   1044         self._iterating = self._original_iterator is not None
   1046     while self.dispatch_one_batch(iterator):

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:861, in Parallel.dispatch_one_batch(self, iterator)
    859     return False
    860 else:
--> 861     self._dispatch(tasks)
    862     return True

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:779, in Parallel._dispatch(self, batch)
    777 with self._lock:
    778     job_idx = len(self._jobs)
--> 779     job = self._backend.apply_async(batch, callback=cb)
    780     # A job can complete so quickly than its callback is
    781     # called before we get here, causing self._jobs to
    782     # grow. To ensure correct results ordering, .insert is
    783     # used (rather than .append) in the following line
    784     self._jobs.insert(job_idx, job)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
    206 def apply_async(self, func, callback=None):
    207     """Schedule a func to be run"""
--> 208     result = ImmediateResult(func)
    209     if callback:
    210         callback(result)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
    569 def __init__(self, batch):
    570     # Don't delay the application, to avoid keeping the input
    571     # arguments in memory
--> 572     self.results = batch()

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in BatchedCalls.__call__(self)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in <listcomp>(.0)
    258 def __call__(self):
    259     # Set the default nested backend to self._backend but do not set the
    260     # change the default number of processes to -1
    261     with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262         return [func(*args, **kwargs)
    263                 for func, args, kwargs in self.items]

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/fixes.py:216, in _FuncWrapper.__call__(self, *args, **kwargs)
    214 def __call__(self, *args, **kwargs):
    215     with config_context(**self.config):
--> 216         return self.function(*args, **kwargs)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_base.py:42, in _fit_single_estimator(estimator, X, y, sample_weight, message_clsname, message)
     40 else:
     41     with _print_elapsed_time(message_clsname, message):
---> 42         estimator.fit(X, y)
     43 return estimator

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:390, in Pipeline.fit(self, X, y, **fit_params)
    364 """Fit the model.
    365 
    366 Fit all the transformers one after the other and transform the
   (...)
    387     Pipeline with fitted steps.
    388 """
    389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
    391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
    392     if self._final_estimator != "passthrough":

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:348, in Pipeline._fit(self, X, y, **fit_params_steps)
    346     cloned_transformer = clone(transformer)
    347 # Fit or load from cache the current transformer
--> 348 X, fitted_transformer = fit_transform_one_cached(
    349     cloned_transformer,
    350     X,
    351     y,
    352     None,
    353     message_clsname="Pipeline",
    354     message=self._log_message(step_idx),
    355     **fit_params_steps[name],
    356 )
    357 # Replace the transformer of the step with the fitted
    358 # transformer. This is necessary when loading the transformer
    359 # from the cache.
    360 self.steps[step_idx] = (name, fitted_transformer)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
    348 def __call__(self, *args, **kwargs):
--> 349     return self.func(*args, **kwargs)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:893, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891 with _print_elapsed_time(message_clsname, message):
    892     if hasattr(transformer, "fit_transform"):
--> 893         res = transformer.fit_transform(X, y, **fit_params)
    894     else:
    895         res = transformer.fit(X, y, **fit_params).transform(X)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:434, in Pipeline.fit_transform(self, X, y, **fit_params)
    432 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    433 if hasattr(last_step, "fit_transform"):
--> 434     return last_step.fit_transform(Xt, y, **fit_params_last_step)
    435 else:
    436     return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:672, in ColumnTransformer.fit_transform(self, X, y)
    670 self._check_n_features(X, reset=True)
    671 self._validate_transformers()
--> 672 self._validate_column_callables(X)
    673 self._validate_remainder(X)
    675 result = self._fit_transform(X, y, _fit_transform_one)

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:352, in ColumnTransformer._validate_column_callables(self, X)
    350         columns = columns(X)
    351     all_columns.append(columns)
--> 352     transformer_to_input_indices[name] = _get_column_indices(X, columns)
    354 self._columns = all_columns
    355 self._transformer_to_input_indices = transformer_to_input_indices

File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:411, in _get_column_indices(X, key)
    409     all_columns = X.columns
    410 except AttributeError:
--> 411     raise ValueError(
    412         "Specifying the columns using strings is only "
    413         "supported for pandas DataFrames"
    414     )
    415 if isinstance(key, str):
    416     columns = [key]

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

完整管道图

Answer 1

您的分类管道将两个列转换器链接在一起。在第一个之后，输出是一个 numpy 数组，但是第二个不能按照您的要求按列名 select 变换。请注意，此处的最终错误消息内容更丰富，ValueError: Specifying the columns using strings is only supported for pandas DataFrames.

出于这个原因，我建议使用一个带有单独管道的列转换器，而不是一个带有多个列转换器的管道。

StackingClassifier 引发异常 'numpy.ndarray' 对象没有属性 'columns'

StackingClassifier Raises Exception 'numpy.ndarray' object has no attribute 'columns'

python

numpy

scikit-learn