为什么 StackingClassifier 会引发错误而组件分类器不会?
Why Does StackingClassifier Raise Error When Component Classifier Does Not?
我正在使用 StackingClassifier 组合几个模型管道来预测 UCI 糖尿病数据集的再入院率。每个管道都可以独立运行,但在尝试组合它们时我总是 运行 遇到问题。我想知道为什么独立文本分类器会 运行,而堆叠式分类器不会,我该如何解决它。
这是引发错误的部分:
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
现在是一个工作正常的组件分类器示例:
x_train, x_test, y_train, y_test = train_test_split(
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data["readmitted"]
)
text_pipe.fit(x_train, y_train).score(x_test, y_test)
0.5935
因为我不清楚错误源自管道中的哪个位置,所以我在下面提供了完整的最小可重现示例。
Select 列
text_data = [
"diag_1_desc",
"diag_2_desc",
"diag_3_desc"
]
scalar_data = [
"num_medications",
"time_in_hospital",
"num_lab_procedures",
"num_procedures",
"number_outpatient",
"number_emergency",
"number_inpatient",
"number_diagnoses",
]
ordinal_data = [
"age"
]
categorical_data = [
"race",
"gender",
"admission_type_id",
"discharge_disposition_id",
"admission_source_id",
"insulin",
"diabetesMed",
"change",
"A1Cresult",
"metformin",
"repaglinide",
"nateglinide",
"chlorpropamide",
"glimepiride",
"glipizide",
"glyburide",
"tolbutamide",
"pioglitazone",
"rosiglitazone",
"acarbose",
"miglitol",
"tolazamide",
"glyburide.metformin",
"glipizide.metformin",
]
创建逻辑回归分类器
logreg = LogisticRegression(
solver = "saga",
penalty="elasticnet",
l1_ratio=0.5,
max_iter=1000
)
创建列转换器
text_trans = compose.make_column_transformer(
(TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
remainder="passthrough",
)
scalar_trans = compose.make_column_transformer(
(
preprocessing.StandardScaler(),
scalar_data
),
remainder="passthrough",
)
cat_trans = compose.make_column_transformer(
(
preprocessing.OneHotEncoder(
sparse=False,
handle_unknown="ignore"
),
categorical_data
),
(
preprocessing.OrdinalEncoder(),
ordinal_data
),
remainder="passthrough",
)
创建管道估算器
text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)
estimators = [
("cat", cat_pipe),
("text", text_pipe),
("scalar", scalar_pipe)
]
创建并拟合堆叠分类器
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
我的管道还依赖于两个辅助函数,我使用它们通过删除标点符号和停用词来预处理文本数据。
辅助函数
def preprocess_text(text):
try:
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.lower().split()
text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
return ' '.join(text)
except TypeError:
return ''
def preprocess_series(series):
texts = []
for i in range(len(series)):
texts.append(preprocess_text(series[i]))
return pd.Series(texts)
看起来您的组件管道并非全部有效,只有文本管道有效。您的其他管道使用带有 remainder='passthrough'
的列变换器,这意味着它们通过测试列时不会受到影响,逻辑回归将对此犹豫不决。
我正在使用 StackingClassifier 组合几个模型管道来预测 UCI 糖尿病数据集的再入院率。每个管道都可以独立运行,但在尝试组合它们时我总是 运行 遇到问题。我想知道为什么独立文本分类器会 运行,而堆叠式分类器不会,我该如何解决它。
这是引发错误的部分:
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
# This line throws the error in the fit function
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
现在是一个工作正常的组件分类器示例:
x_train, x_test, y_train, y_test = train_test_split(
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data["readmitted"]
)
text_pipe.fit(x_train, y_train).score(x_test, y_test)
0.5935
因为我不清楚错误源自管道中的哪个位置,所以我在下面提供了完整的最小可重现示例。
Select 列
text_data = [
"diag_1_desc",
"diag_2_desc",
"diag_3_desc"
]
scalar_data = [
"num_medications",
"time_in_hospital",
"num_lab_procedures",
"num_procedures",
"number_outpatient",
"number_emergency",
"number_inpatient",
"number_diagnoses",
]
ordinal_data = [
"age"
]
categorical_data = [
"race",
"gender",
"admission_type_id",
"discharge_disposition_id",
"admission_source_id",
"insulin",
"diabetesMed",
"change",
"A1Cresult",
"metformin",
"repaglinide",
"nateglinide",
"chlorpropamide",
"glimepiride",
"glipizide",
"glyburide",
"tolbutamide",
"pioglitazone",
"rosiglitazone",
"acarbose",
"miglitol",
"tolazamide",
"glyburide.metformin",
"glipizide.metformin",
]
创建逻辑回归分类器
logreg = LogisticRegression(
solver = "saga",
penalty="elasticnet",
l1_ratio=0.5,
max_iter=1000
)
创建列转换器
text_trans = compose.make_column_transformer(
(TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
remainder="passthrough",
)
scalar_trans = compose.make_column_transformer(
(
preprocessing.StandardScaler(),
scalar_data
),
remainder="passthrough",
)
cat_trans = compose.make_column_transformer(
(
preprocessing.OneHotEncoder(
sparse=False,
handle_unknown="ignore"
),
categorical_data
),
(
preprocessing.OrdinalEncoder(),
ordinal_data
),
remainder="passthrough",
)
创建管道估算器
text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_trans, logreg)
estimators = [
("cat", cat_pipe),
("text", text_pipe),
("scalar", scalar_pipe)
]
创建并拟合堆叠分类器
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression()
)
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
diabetes_data[text_data].apply(lambda x: preprocess_series(x)),
diabetes_data[categorical_data+ordinal_data+scalar_data]
], axis=1
),
diabetes_data["readmitted"]
)
stack_clf.fit(x_train, y_train).score(x_test, y_test)
ValueError: could not convert string to float: 'bronchitis specified acute chronic'
我的管道还依赖于两个辅助函数,我使用它们通过删除标点符号和停用词来预处理文本数据。
辅助函数
def preprocess_text(text):
try:
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.lower().split()
text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))]
text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1]
return ' '.join(text)
except TypeError:
return ''
def preprocess_series(series):
texts = []
for i in range(len(series)):
texts.append(preprocess_text(series[i]))
return pd.Series(texts)
看起来您的组件管道并非全部有效,只有文本管道有效。您的其他管道使用带有 remainder='passthrough'
的列变换器,这意味着它们通过测试列时不会受到影响,逻辑回归将对此犹豫不决。