功能引擎:在 SklearnTransformerWrapper 中包装 OneHotEncoder 时,交叉验证会出错
feature-engine: cross-validation gives error when wrapping OneHotEncoder in SklearnTransformerWrapper
问题
我正在使用 feature-engine library, and am finding that when I create an sklearn Pipeline that uses the SklearnTransformerWrapper to wrap a OneHotEncoder, I get the following error when trying to run cross-validation:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
...
9 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
Below are more details about the failures:
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
如果我用 sklearn ColumnTransformer 以“老方法”做事,我不会收到错误。
如果我执行以下任一操作,我也不会出错:A) 得分 without cross-validation 或 B) 不要使用分类特征(即删除单热编码)。
这是 SklearnTransformerWrapper
的问题还是我用错了?
代码
这是 Pipeline
设置 SklearnTransformerWrapper
失败了。如果我们不使用分类特征,或者如果我们不进行交叉验证,它将成功运行(参见代码中的注释):
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.selection import DropFeatures
pipeline_new = Pipeline(steps=[
("scale_b_c", SklearnTransformerWrapper(
transformer=StandardScaler(),
variables=["b", "c"]
)
),
# Comment out this step for cross-validation to not fail
("encode_a_d", SklearnTransformerWrapper(
transformer=OneHotEncoder(drop="first", sparse=False),
variables=["a", "d"]
)
),
("cleanup", DropFeatures(["a", "d"])),
("model", LinearRegression())
])
# Defined later (putting main example up front)
# Set cv to False to successfully score entire training set
do_test(df, pipeline_new, cv=True)
这是使用 ColumnTransformer
的“旧式”管道;它工作正常:
from sklearn.compose import ColumnTransformer
pipeline_old = Pipeline(steps=[
(
"xform", ColumnTransformer([
("cat", OneHotEncoder(drop="first"), ["a", "d"]),
("num", StandardScaler(), ["b", "c"])
])
),
("model", LinearRegression())
])
# Defined later (putting main example up front)
do_test(df, pipeline_old, cv=True)
支持代码:do_test()
测试函数的实现:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
# do_test() implementation
def do_test(df, pipeline, cv=True):
X = df.drop(columns=["y"])
y = df[["y"]]
if cv:
return cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
else:
pipeline.fit(X, y)
y_pred = pipeline.predict(X)
return mean_squared_error(y, y_pred)
支持代码:示例数据创建。
import pandas as pd
import numpy as np
# Create sample data
n = 20000
df = pd.DataFrame({
"a": [["alpha", "beta", "gamma", "delta"][np.random.randint(4)] for i in range(n)],
"b": [np.random.random() * 100 for i in range(n)],
"c": [np.random.random() * 200 for i in range(n)],
"d": [["east", "west"][np.random.randint(2)] for i in range(n)],
})
def make_y(x):
add_1 = 100 if x.a in ["alpha", "beta"] else 200
add_2 = 100 if x.d in ["east"] else 300
return 2 * x.b + 3 * x.c + 2 * add_1 + 5 * add_2 + np.random.normal(10)
df["y"] = df.apply(make_y, axis=1)
注意:我没有做train/test分离,为了让问题更短。
验证管道中的 "encode_a_d"
步骤 SklearnTransformerWrapper
在 cross-validation 期间产生 NaN 很简单:
kf = KFold(n_splits = 10)
for train_index, test_index in kf.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
X_train_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_train))
X_test_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_test))
print(X_train_pipe.isnull().any().any(), X_test_pipe.isnull().any().any())
它似乎使行数加倍,并为特征 ['b', 'c']
设置 NaN,其中由 ['a', 'd']
形成的 one-hot-encoded 个特征具有其通常的值,反之亦然。至于为什么会这样——我不知道,可能是 feature-engine
的错,但根据我的经验,这很可能是 cross_val_score
的一些恶作剧。
@AlwaysRightNeverLeft 给出的输出描述表明索引存在问题:当 cross-validating 时,数据帧将具有非标准索引,而当 SklearnTransformerWrapper
合并 one-hot 编码时数组到原始数据,它执行“外部连接”。
问题
我正在使用 feature-engine library, and am finding that when I create an sklearn Pipeline that uses the SklearnTransformerWrapper to wrap a OneHotEncoder, I get the following error when trying to run cross-validation:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
...
9 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
Below are more details about the failures:
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
如果我用 sklearn ColumnTransformer 以“老方法”做事,我不会收到错误。 如果我执行以下任一操作,我也不会出错:A) 得分 without cross-validation 或 B) 不要使用分类特征(即删除单热编码)。
这是 SklearnTransformerWrapper
的问题还是我用错了?
代码
这是 Pipeline
设置 SklearnTransformerWrapper
失败了。如果我们不使用分类特征,或者如果我们不进行交叉验证,它将成功运行(参见代码中的注释):
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.selection import DropFeatures
pipeline_new = Pipeline(steps=[
("scale_b_c", SklearnTransformerWrapper(
transformer=StandardScaler(),
variables=["b", "c"]
)
),
# Comment out this step for cross-validation to not fail
("encode_a_d", SklearnTransformerWrapper(
transformer=OneHotEncoder(drop="first", sparse=False),
variables=["a", "d"]
)
),
("cleanup", DropFeatures(["a", "d"])),
("model", LinearRegression())
])
# Defined later (putting main example up front)
# Set cv to False to successfully score entire training set
do_test(df, pipeline_new, cv=True)
这是使用 ColumnTransformer
的“旧式”管道;它工作正常:
from sklearn.compose import ColumnTransformer
pipeline_old = Pipeline(steps=[
(
"xform", ColumnTransformer([
("cat", OneHotEncoder(drop="first"), ["a", "d"]),
("num", StandardScaler(), ["b", "c"])
])
),
("model", LinearRegression())
])
# Defined later (putting main example up front)
do_test(df, pipeline_old, cv=True)
支持代码:do_test()
测试函数的实现:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
# do_test() implementation
def do_test(df, pipeline, cv=True):
X = df.drop(columns=["y"])
y = df[["y"]]
if cv:
return cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
else:
pipeline.fit(X, y)
y_pred = pipeline.predict(X)
return mean_squared_error(y, y_pred)
支持代码:示例数据创建。
import pandas as pd
import numpy as np
# Create sample data
n = 20000
df = pd.DataFrame({
"a": [["alpha", "beta", "gamma", "delta"][np.random.randint(4)] for i in range(n)],
"b": [np.random.random() * 100 for i in range(n)],
"c": [np.random.random() * 200 for i in range(n)],
"d": [["east", "west"][np.random.randint(2)] for i in range(n)],
})
def make_y(x):
add_1 = 100 if x.a in ["alpha", "beta"] else 200
add_2 = 100 if x.d in ["east"] else 300
return 2 * x.b + 3 * x.c + 2 * add_1 + 5 * add_2 + np.random.normal(10)
df["y"] = df.apply(make_y, axis=1)
注意:我没有做train/test分离,为了让问题更短。
验证管道中的 "encode_a_d"
步骤 SklearnTransformerWrapper
在 cross-validation 期间产生 NaN 很简单:
kf = KFold(n_splits = 10)
for train_index, test_index in kf.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
X_train_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_train))
X_test_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_test))
print(X_train_pipe.isnull().any().any(), X_test_pipe.isnull().any().any())
它似乎使行数加倍,并为特征 ['b', 'c']
设置 NaN,其中由 ['a', 'd']
形成的 one-hot-encoded 个特征具有其通常的值,反之亦然。至于为什么会这样——我不知道,可能是 feature-engine
的错,但根据我的经验,这很可能是 cross_val_score
的一些恶作剧。
@AlwaysRightNeverLeft 给出的输出描述表明索引存在问题:当 cross-validating 时,数据帧将具有非标准索引,而当 SklearnTransformerWrapper
合并 one-hot 编码时数组到原始数据,它执行“外部连接”。