用于与 cross_val_score() 一起使用的 scikit-learn Iterativeimputer 的包装自定义 class
Wrapper custom class for scikit-learn's Iterative Imputer for use with cross_val_score()
Scikit-learn 的迭代插补器可以 循环法 方式插补缺失值。为了评估其相对于其他传统回归器的性能,可以构建一个简单的管道并从 cross_val_score 获得评分指标。问题是 Iterative Imputer 根据错误没有 'predict' 方法:
AttributeError: 'IterativeImputer' object has no attribute 'predict'
查看要实现的最小示例:
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])
train, test = df.values, df['A'].values
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)
存在哪些可能的解决方案?如果需要自定义包装器,应该如何编写以包含 'predict' 方法?
cross_val_score
需要 pipeline
和 model
最后(有 predict
)
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge() # any model
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
cross_val_score
没有 model
没有意义。
我还看到其他问题 - 您在 cross_val_score
.
中使用的值 train
、test
它应该是 X
, y
而不是 train
, test
但它只是名字所以它不是那么重要但重要的是你分配给什么变量。
问题是 X
应该没有 y
但你使用 train = df.values
所以你用 y
创建了 X
df_train = pd.DataFrame({
'X': range(20),
'y': range(20),
})
X_train = df_train[ ['X'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
(顺便说一句:您不必使用 .values
)
多列也一样
df_train = pd.DataFrame({
'A': range(20),
'B': range(20),
'y': range(20),
})
X_train = df_train[ ['A', 'B'] ]
y_train = df_train[ 'y' ]
最少的工作代码但带有假数据(无用)
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge
df_train = pd.DataFrame({
'A': range(100), # fake data
'B': range(100), # fake data
'y': range(100), # fake data
})
df_test = pd.DataFrame({
'A': range(20), # fake data
'B': range(20), # fake data
'y': range(20), # fake data
})
# define scaler, model and pipeline
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
X_train = df_train[ ['A', 'B'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
print(scores)
X_test = df_test[['A', 'B']]
y_test = df_test['y']
scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
print(scores)
Scikit-learn 的迭代插补器可以 循环法 方式插补缺失值。为了评估其相对于其他传统回归器的性能,可以构建一个简单的管道并从 cross_val_score 获得评分指标。问题是 Iterative Imputer 根据错误没有 'predict' 方法:
AttributeError: 'IterativeImputer' object has no attribute 'predict'
查看要实现的最小示例:
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# define scaler, model and pipeline
scaler = StandardScaler() # use any scaler
imputer = IterativeImputer() # with any estimator, default = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer)])
train, test = df.values, df['A'].values
scores = cross_val_score(pipeline, train, test, cv=10, scoring='r2')
print(scores)
存在哪些可能的解决方案?如果需要自定义包装器,应该如何编写以包含 'predict' 方法?
cross_val_score
需要 pipeline
和 model
最后(有 predict
)
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge() # any model
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
cross_val_score
没有 model
没有意义。
我还看到其他问题 - 您在 cross_val_score
.
train
、test
它应该是 X
, y
而不是 train
, test
但它只是名字所以它不是那么重要但重要的是你分配给什么变量。
问题是 X
应该没有 y
但你使用 train = df.values
所以你用 y
X
df_train = pd.DataFrame({
'X': range(20),
'y': range(20),
})
X_train = df_train[ ['X'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
(顺便说一句:您不必使用 .values
)
多列也一样
df_train = pd.DataFrame({
'A': range(20),
'B': range(20),
'y': range(20),
})
X_train = df_train[ ['A', 'B'] ]
y_train = df_train[ 'y' ]
最少的工作代码但带有假数据(无用)
# import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import BayesianRidge
df_train = pd.DataFrame({
'A': range(100), # fake data
'B': range(100), # fake data
'y': range(100), # fake data
})
df_test = pd.DataFrame({
'A': range(20), # fake data
'B': range(20), # fake data
'y': range(20), # fake data
})
# define scaler, model and pipeline
scaler = StandardScaler()
imputer = IterativeImputer()
model = BayesianRidge()
pipeline = Pipeline(steps=[('s', scaler), ('i', imputer), ('m', model)])
X_train = df_train[ ['A', 'B'] ] # it needs inner `[]` to create DataFrame, not Series
y_train = df_train[ 'y' ] # it has to be single column (Series)
scores = cross_val_score(pipeline, X_train, y_train, cv=10, scoring='r2')
print(scores)
X_test = df_test[['A', 'B']]
y_test = df_test['y']
scores = cross_val_score(pipeline, X_test, y_test, cv=10, scoring='r2')
print(scores)