为什么 StackingClassifier 会引发错误而组件分类器不会?

Why Does StackingClassifier Raise Error When Component Classifier Does Not?

我正在使用 StackingClassifier 组合几个模型管道来预测 UCI 糖尿病数据集的再入院率。每个管道都可以独立运行,但在尝试组合它们时我总是 运行 遇到问题。我想知道为什么独立文本分类器会 运行,而堆叠式分类器不会,我该如何解决它。

这是引发错误的部分:

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

现在是一个工作正常的组件分类器示例:

x_train, x_test, y_train, y_test = train_test_split(
    diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
    diabetes_data["readmitted"]
)

text_pipe.fit(x_train, y_train).score(x_test, y_test)

0.5935

因为我不清楚错误源自管道中的哪个位置,所以我在下面提供了完整的最小可重现示例。

Select 列

text_data = [
    "diag_1_desc",
    "diag_2_desc",
    "diag_3_desc"
]

scalar_data = [
    "num_medications",
    "time_in_hospital",
    "num_lab_procedures",
    "num_procedures",
    "number_outpatient",
    "number_emergency",
    "number_inpatient",
    "number_diagnoses",
]

ordinal_data = [
    "age"
]

categorical_data = [
    "race",
    "gender",
    "admission_type_id",
    "discharge_disposition_id",
    "admission_source_id",
    "insulin",
    "diabetesMed",
    "change",
    "A1Cresult",
    "metformin",
    "repaglinide",
    "nateglinide",
    "chlorpropamide",
    "glimepiride",
    "glipizide",
    "glyburide",
    "tolbutamide",
    "pioglitazone",
    "rosiglitazone",
    "acarbose",
    "miglitol",
    "tolazamide",
    "glyburide.metformin",
    "glipizide.metformin",    
]

创建逻辑回归分类器

logreg = LogisticRegression(
    solver = "saga",
    penalty="elasticnet",
    l1_ratio=0.5,
    max_iter=1000
)

创建列转换器

text_trans = compose.make_column_transformer(
    (TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
    (TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
    remainder="passthrough",
)

scalar_trans = compose.make_column_transformer(
    (
        preprocessing.StandardScaler(),
        scalar_data
    ),
    remainder="passthrough",
)

cat_trans = compose.make_column_transformer(
    (
        preprocessing.OneHotEncoder(
            sparse=False,
            handle_unknown="ignore"
        ),
        categorical_data
    ),
    (
        preprocessing.OrdinalEncoder(),
        ordinal_data
    ),
    remainder="passthrough",
)

创建管道估算器

text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)

estimators = [
    ("cat", cat_pipe),
    ("text", text_pipe),
    ("scalar", scalar_pipe)
]

创建并拟合堆叠分类器

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression()
)

x_train, x_test, y_train, y_test = train_test_split(
    pd.concat([
        diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
        diabetes_data[categorical_data+ordinal_data+scalar_data]
    ], axis=1
    ),
    diabetes_data["readmitted"]                                                
)

stack_clf.fit(x_train, y_train).score(x_test, y_test)

ValueError: could not convert string to float: 'bronchitis specified acute chronic'

我的管道还依赖于两个辅助函数,我使用它们通过删除标点符号和停用词来预处理文本数据。

辅助函数

def preprocess_text(text):
    try:
        text = re.sub('[^a-zA-Z]', ' ', text)
        text = text.lower().split()
        text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
        text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
        return ' '.join(text)
    except TypeError:
        return ''

def preprocess_series(series):
    texts = []
    for i in range(len(series)):
        texts.append(preprocess_text(series[i]))
    return pd.Series(texts)

看起来您的组件管道并非全部有效,只有文本管道有效。您的其他管道使用带有 remainder='passthrough' 的列变换器,这意味着它们通过测试列时不会受到影响,逻辑回归将对此犹豫不决。