功能引擎:在 SklearnTransformerWrapper 中包装 OneHotEncoder 时,交叉验证会出错

feature-engine: cross-validation gives error when wrapping OneHotEncoder in SklearnTransformerWrapper

问题

我正在使用 feature-engine library, and am finding that when I create an sklearn Pipeline that uses the SklearnTransformerWrapper to wrap a OneHotEncoder, I get the following error when trying to run cross-validation:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
...
9 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.

Below are more details about the failures:
...
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

如果我用 sklearn ColumnTransformer 以“老方法”做事,我不会收到错误。 如果我执行以下任一操作,我也不会出错:A) 得分 without cross-validation 或 B) 不要使用分类特征(即删除单热编码)。

这是 SklearnTransformerWrapper 的问题还是我用错了?

代码

这是 Pipeline 设置 SklearnTransformerWrapper 失败了。如果我们不使用分类特征,或者如果我们不进行交叉验证,它将成功运行(参见代码中的注释):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression

from feature_engine.wrappers import SklearnTransformerWrapper
from feature_engine.selection import DropFeatures


pipeline_new = Pipeline(steps=[
    ("scale_b_c", SklearnTransformerWrapper(
            transformer=StandardScaler(), 
            variables=["b", "c"]
        )
    ),
    
    # Comment out this step for cross-validation to not fail
    ("encode_a_d", SklearnTransformerWrapper(
            transformer=OneHotEncoder(drop="first", sparse=False), 
            variables=["a", "d"]
        )
    ),
    
    ("cleanup", DropFeatures(["a", "d"])),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
# Set cv to False to successfully score entire training set
do_test(df, pipeline_new, cv=True)

这是使用 ColumnTransformer 的“旧式”管道;它工作正常:

from sklearn.compose import ColumnTransformer


pipeline_old = Pipeline(steps=[
    (
        "xform", ColumnTransformer([
            ("cat", OneHotEncoder(drop="first"), ["a", "d"]),
            ("num", StandardScaler(), ["b", "c"])
        ])
    ),
    ("model", LinearRegression())
])

# Defined later (putting main example up front)
do_test(df, pipeline_old, cv=True)

支持代码:do_test()测试函数的实现:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# do_test() implementation
def do_test(df, pipeline, cv=True):
    X = df.drop(columns=["y"])
    y = df[["y"]]
       
    if cv:
        return cross_val_score(pipeline, X, y, scoring="neg_mean_squared_error", cv=10)
    else:
        pipeline.fit(X, y)
        y_pred = pipeline.predict(X)        
        return mean_squared_error(y, y_pred)

支持代码:示例数据创建。

import pandas as pd
import numpy as np

# Create sample data
n = 20000
df = pd.DataFrame({
    "a": [["alpha", "beta", "gamma", "delta"][np.random.randint(4)] for i in range(n)],
    "b": [np.random.random() * 100 for i in range(n)],
    "c": [np.random.random() * 200 for i in range(n)],
    "d": [["east", "west"][np.random.randint(2)] for i in range(n)],
})

def make_y(x):
    add_1 = 100 if x.a in ["alpha", "beta"] else 200
    add_2 = 100 if x.d in ["east"] else 300

    return 2 * x.b + 3 * x.c + 2 * add_1 + 5 * add_2 + np.random.normal(10)

df["y"] = df.apply(make_y, axis=1)

注意:我没有做train/test分离,为了让问题更短。

验证管道中的 "encode_a_d" 步骤 SklearnTransformerWrapper 在 cross-validation 期间产生 NaN 很简单:

kf = KFold(n_splits = 10)

for train_index, test_index in kf.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    X_train_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_train))
    X_test_pipe = pipeline_new["encode_a_d"].fit_transform(pipeline_new["scale_b_c"].fit_transform(X_test))
    print(X_train_pipe.isnull().any().any(), X_test_pipe.isnull().any().any())

它似乎使行数加倍,并为特征 ['b', 'c'] 设置 NaN,其中由 ['a', 'd'] 形成的 one-hot-encoded 个特征具有其通常的值,反之亦然。至于为什么会这样——我不知道,可能是 feature-engine 的错,但根据我的经验,这很可能是 cross_val_score 的一些恶作剧。

@AlwaysRightNeverLeft 给出的输出描述表明索引存在问题:当 cross-validating 时,数据帧将具有非标准索引,而当 SklearnTransformerWrapper 合并 one-hot 编码时数组到原始数据,它执行“外部连接”。