无法在分类列上训练 xgboost
Trouble training xgboost on categorical column
我正在尝试 运行 一个 Python 笔记本 (link)。在下面一行 In [446]: where author train XGBoost
, I am getting an error
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields StateHoliday, Assortment
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
这是用于测试的最小代码
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day',
'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen',
'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, # 30% for the evaluation set
random_state = 42)
# base parameters
params = {
'booster': 'gbtree',
'objective': 'reg:linear', # regression task
'subsample': 0.8, # 80% of data to grow trees and prevent overfitting
'colsample_bytree': 0.85, # 85% of features used
'eta': 0.1,
'max_depth': 10,
'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Link 到 train_store 数据文件:Link 1
如错误消息所示,xgboost
不高兴,因为您尝试向它提供未知类型。它说它不能处理分类或日期时间特征。检查 StateHoliday, Assortment
特征的类型并以某种方式将它们编码为数字(例如 One-Hot 编码、标签编码(适用于 treee-based 模型)或目标编码)
我在做罗斯曼销售预测项目时遇到了完全相同的问题。
新版本的 xgboost 似乎不接受 StateHoliday、Assortment 和 StoreType 的数据类型。
您可以使用
按照 Mykhailo Lisovyi 的建议检查数据类型
print(test_train.dtypes)
您需要将此处的 test_train 替换为您的 X_train
你可能会得到
DayOfWeek int64
Promo int64
StateHoliday int64
SchoolHoliday int64
StoreType object
Assortment object
CompetitionDistance float64
CompetitionOpenSinceMonth float64
CompetitionOpenSinceYear float64
Promo2 int64
Promo2SinceWeek float64
Promo2SinceYear float64
Year int64
Month int64
Day int64
错误引发了 object 类型。您可以使用
转换它们
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
完成这些步骤后一切都会好起来的。
试试这个
train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
H2O 包中的 XGBoost 版本可以处理分类变量(但不能太多!)但 XGBoost 作为它自己的包似乎不能。
我用 pandas 数据帧试过了,但 xgboost 不喜欢它
categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')
要将 H2O 用于分类,您必须先将字符串转换为分类:
h2odf[categoricals] = h2odf[categoricals].asfactor()
还要注意,h2o 有自己的数据帧,不同于 pandas。
我正在尝试 运行 一个 Python 笔记本 (link)。在下面一行 In [446]: where author train XGBoost
, I am getting an error
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
这是用于测试的最小代码
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day',
'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth',
'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen',
'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3, # 30% for the evaluation set
random_state = 42)
# base parameters
params = {
'booster': 'gbtree',
'objective': 'reg:linear', # regression task
'subsample': 0.8, # 80% of data to grow trees and prevent overfitting
'colsample_bytree': 0.85, # 85% of features used
'eta': 0.1,
'max_depth': 10,
'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Link 到 train_store 数据文件:Link 1
如错误消息所示,xgboost
不高兴,因为您尝试向它提供未知类型。它说它不能处理分类或日期时间特征。检查 StateHoliday, Assortment
特征的类型并以某种方式将它们编码为数字(例如 One-Hot 编码、标签编码(适用于 treee-based 模型)或目标编码)
我在做罗斯曼销售预测项目时遇到了完全相同的问题。 新版本的 xgboost 似乎不接受 StateHoliday、Assortment 和 StoreType 的数据类型。 您可以使用
按照 Mykhailo Lisovyi 的建议检查数据类型print(test_train.dtypes)
您需要将此处的 test_train 替换为您的 X_train
你可能会得到
DayOfWeek int64
Promo int64
StateHoliday int64
SchoolHoliday int64
StoreType object
Assortment object
CompetitionDistance float64
CompetitionOpenSinceMonth float64
CompetitionOpenSinceYear float64
Promo2 int64
Promo2SinceWeek float64
Promo2SinceYear float64
Year int64
Month int64
Day int64
错误引发了 object 类型。您可以使用
转换它们from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
完成这些步骤后一切都会好起来的。
试试这个
train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])
H2O 包中的 XGBoost 版本可以处理分类变量(但不能太多!)但 XGBoost 作为它自己的包似乎不能。
我用 pandas 数据帧试过了,但 xgboost 不喜欢它
categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')
要将 H2O 用于分类,您必须先将字符串转换为分类:
h2odf[categoricals] = h2odf[categoricals].asfactor()
还要注意,h2o 有自己的数据帧,不同于 pandas。