'MultiOutputClassifier' 创建管道时对象不可迭代 (Python)

'MultiOutputClassifier' object is not iterable when creating a Pipeline (Python)

我想创建一个继续编码的管道,然后缩放 xgboost 分类器以解决多标签问题。 代码块;

# Create a boolean mask for categorical columns
categorical_columns = X.columns[X.dtypes == 'O'].tolist()

#Distinct columns for to find catagories
unique_list = [X[c].unique().tolist() for c in categorical_columns]

# Create a boolean mask for numerical columns
numerical_columns = X.columns[X.dtypes != 'O'].tolist()

#Encoding & Scaling objects
scaler = StandardScaler()
ohe = OneHotEncoder(categories=unique_list, sparse=False)

#Define a pipeline
pipeline  = Pipeline([("ohe_onestep", ohe.fit_transform(X[categorical_columns])),  
         ("scaler_onestep", scaler.fit_transform(X[numerical_columns])),
         MultiOutputClassifier(xgb.XGBClassifier(objective='binary:logistic'))])

# Cross-validate the model
cross_val_scores = cross_val_score(pipeline, X, y, 
                                   scoring='accuracy', cv=5)

但是当我运行代码出现这个错误; 行是;

> pipeline = Pipeline([("ohe_onestep", ohe.fit_transform(X[categorical_columns])),

'MultiOutputClassifier' object is not iterable

我该如何解决这个问题?

两件事:首先,您需要将转换器或估算器 本身 传递给管道,而不是 fitting/transforming 它们的结果(这将给出结果数组到管道而不是变压器,它会失败)。管道本身将是 fitting/transforming。其次,由于您对特定列进行了特定转换,因此需要 ColumnTransformer

将这些放在一起:

from sklearn.compose import ColumnTransformer

col_transformers = ColumnTransformer([
                          # name, transformer itself, columns to apply
                          ("scaler_onestep", scaler, numerical_columns),
                          ("ohe_onestep", ohe, categorical_columns)])

model = MultiOutputClassifier(xgb.XGBClassifier(objective="binary:logistic"))

pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])

现在你可以做

cross_val_scores = cross_val_score(pipeline, X, y, 
                                   scoring="accuracy", cv=5)

另外,通常您可以使用 make_column_selector with dtype option to let it infer the numericals and categoricals as exemplified here.

最后,你得到错误的原因:Pipeline需要一个元组列表。您确实为前两项传递了元组,即 scalerohe,但您没有将 (<name>, model) 元组作为第三项传递;相反,你直接给它模型,它试图迭代它来获取这些名称等,但失败了。