用于与 cross_val_score() 一起使用的 scikit-learn Iterativeimputer 的包装自定义 class

Question

Scikit-learn 的迭代插补器可以 循环法 方式插补缺失值。为了评估其相对于其他传统回归器的性能，可以构建一个简单的管道并从 cross_val_score 获得评分指标。问题是 Iterative Imputer 根据错误没有 'predict' 方法：

AttributeError: 'IterativeImputer' object has no attribute 'predict'

查看要实现的最小示例：

# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])

train, test = df.values, df['A'].values 
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)

存在哪些可能的解决方案？如果需要自定义包装器，应该如何编写以包含 'predict' 方法？

Answer 1

cross_val_score 需要 pipeline 和 model 最后（有 predict）

scaler  = StandardScaler()
imputer = IterativeImputer()
model   = BayesianRidge()  # any model

pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])

cross_val_score 没有 model 没有意义。

我还看到其他问题 - 您在 cross_val_score.

中使用的值 train、test

它应该是 X, y 而不是 train, test 但它只是名字所以它不是那么重要但重要的是你分配给什么变量。

问题是 X 应该没有 y 但你使用 train = df.values 所以你用 y

创建了 X

df_train = pd.DataFrame({
                'X': range(20), 
                'y': range(20),
           })

X_train = df_train[ ['X'] ]  # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[  'y'  ]  # it has to be single column (Series)

scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')

（顺便说一句：您不必使用 .values）

多列也一样

df_train = pd.DataFrame({
                'A': range(20), 
                'B': range(20), 
                'y': range(20),
           })

X_train = df_train[ ['A', 'B'] ]
y_train = df_train[ 'y' ]

最少的工作代码但带有假数据（无用）

# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge

df_train = pd.DataFrame({
                'A': range(100),  # fake data
                'B': range(100),  # fake data
                'y': range(100),  # fake data
           })

df_test = pd.DataFrame({
                'A': range(20),  # fake data
                'B': range(20),  # fake data
                'y': range(20),  # fake data
           })

# define scaler, model and pipeline
scaler  = StandardScaler()
imputer = IterativeImputer()
model   = BayesianRidge()

pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])

X_train = df_train[ ['A', 'B'] ]  # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ]         # it has to be single column (Series)

scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
print(scores)

X_test = df_test[['A', 'B']]
y_test = df_test['y']

scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
print(scores)

用于与 cross_val_score() 一起使用的 scikit-learn Iterativeimputer 的包装自定义 class

Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score()

python

machine-learning

missing-data

scikit-learn

imputation