XGBoost 错误 - 提供分类类型时,DMatrix 参数“enable_categorical”必须设置为“True”
XGBoost error - When categorical type is supplied, DMatrix parameter `enable_categorical` must be set to `True`
我有四个类别特征和第五个数值特征 (Var5)。当我尝试以下代码时:
cat_attribs = ['var1','var2','var3','var4']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown = 'ignore'), cat_attribs)], remainder = 'passthrough')
X_train = full_pipeline.fit_transform(X_train)
model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
当模型尝试进行预测时,我收到以下错误消息:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
enable_categorical
must be set to True
.Var1, Var2, Var3, Var4
有人知道这里出了什么问题吗?
如果有用,这里是 X_train 数据和 y_train 数据的一小部分样本:
Var1 Var2 Var3 Var4 Var5
1507856 JP 2009 6581 OME 325.787218
839624 FR 2018 5783 I_S 11.956326
1395729 BE 2015 6719 OME 42.888565
1971169 DK 2011 3506 RPP 70.094146
1140120 AT 2019 5474 NMM 270.082738
和:
Ind_Var
1507856 8.013558
839624 4.105559
1395729 7.830077
1971169 83.000000
1140120 51.710526
您的代码存在的问题是您在 X_train
中编码了分类特征,但在 X_test
中没有编码,因此当您 运行 model.predict(X_test)
时,您会得到一个错误信息。为了解决这个问题,首先你需要将编码器适配到X_train
,然后使用编码器对X_train
和X_test
进行变换。有关示例,请参见下面的代码。
import pandas as pd
from xgboost import XGBRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# define the input data
df = pd.DataFrame([
{'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558},
{'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559},
{'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077},
{'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000},
{'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526}
])
# extract the features and target
X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]
# one-hot encode the categorical features
cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')
encoder = full_pipeline.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
# train the model
model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
model.fit(X_train, y_train)
# extract the training set predictions
model.predict(X_train)
# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)
# extract the test set predictions
model.predict(X_test)
# array([7.0887003, 7.0887003], dtype=float32)
我有四个类别特征和第五个数值特征 (Var5)。当我尝试以下代码时:
cat_attribs = ['var1','var2','var3','var4']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown = 'ignore'), cat_attribs)], remainder = 'passthrough')
X_train = full_pipeline.fit_transform(X_train)
model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
当模型尝试进行预测时,我收到以下错误消息:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When categorical type is supplied, DMatrix parameter
enable_categorical
must be set toTrue
.Var1, Var2, Var3, Var4
有人知道这里出了什么问题吗?
如果有用,这里是 X_train 数据和 y_train 数据的一小部分样本:
Var1 Var2 Var3 Var4 Var5
1507856 JP 2009 6581 OME 325.787218
839624 FR 2018 5783 I_S 11.956326
1395729 BE 2015 6719 OME 42.888565
1971169 DK 2011 3506 RPP 70.094146
1140120 AT 2019 5474 NMM 270.082738
和:
Ind_Var
1507856 8.013558
839624 4.105559
1395729 7.830077
1971169 83.000000
1140120 51.710526
您的代码存在的问题是您在 X_train
中编码了分类特征,但在 X_test
中没有编码,因此当您 运行 model.predict(X_test)
时,您会得到一个错误信息。为了解决这个问题,首先你需要将编码器适配到X_train
,然后使用编码器对X_train
和X_test
进行变换。有关示例,请参见下面的代码。
import pandas as pd
from xgboost import XGBRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# define the input data
df = pd.DataFrame([
{'Var1': 'JP', 'Var2': 2009, 'Var3': 6581, 'Var4': 'OME', 'Var5': 325.787218, 'Ind_Var': 8.013558},
{'Var1': 'FR', 'Var2': 2018, 'Var3': 5783, 'Var4': 'I_S', 'Var5': 11.956326, 'Ind_Var': 4.105559},
{'Var1': 'BE', 'Var2': 2015, 'Var3': 6719, 'Var4': 'OME', 'Var5': 42.888565, 'Ind_Var': 7.830077},
{'Var1': 'DK', 'Var2': 2011, 'Var3': 3506, 'Var4': 'RPP', 'Var5': 70.094146, 'Ind_Var': 83.000000},
{'Var1': 'AT', 'Var2': 2019, 'Var3': 5474, 'Var4': 'NMM', 'Var5': 270.082738, 'Ind_Var': 51.710526}
])
# extract the features and target
X_train, y_train = df.iloc[:3, :-1], df.iloc[:3, -1]
X_test, y_test = df.iloc[3:, :-1], df.iloc[3:, -1]
# one-hot encode the categorical features
cat_attribs = ['Var1', 'Var2', 'Var3', 'Var4']
full_pipeline = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs)], remainder='passthrough')
encoder = full_pipeline.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
# train the model
model = XGBRegressor(n_estimators=10, max_depth=20, verbosity=2)
model.fit(X_train, y_train)
# extract the training set predictions
model.predict(X_train)
# array([7.0887003, 3.7923286, 7.0887003], dtype=float32)
# extract the test set predictions
model.predict(X_test)
# array([7.0887003, 7.0887003], dtype=float32)