无法在分类列上训练 xgboost

Trouble training xgboost on categorical column

我正在尝试 运行 一个 Python 笔记本 (link)。在下面一行 In [446]: where author train XGBoost, I am getting an error

ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment

# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

这是用于测试的最小代码

import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split

with open('train_store', 'rb') as f:
    train_store = pickle.load(f)

train_store.shape

predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']

y = np.log(train_store.Sales) # log transformation of Sales
X = train_store

# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)

# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results

num_round = 60 # default 300

dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)

watchlist = [(dtrain, 'train'), (dtest, 'test')]

xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)

Link 到 train_store 数据文件:Link 1

如错误消息所示,xgboost 不高兴,因为您尝试向它提供未知类型。它说它不能处理分类或日期时间特征。检查 StateHoliday, Assortment 特征的类型并以某种方式将它们编码为数字(例如 One-Hot 编码、标签编码(适用于 treee-based 模型)或目标编码)

我在做罗斯曼销售预测项目时遇到了完全相同的问题。 新版本的 xgboost 似乎不接受 StateHolidayAssortmentStoreType 的数据类型。 您可以使用

按照 Mykhailo Lisovyi 的建议检查数据类型
print(test_train.dtypes)

您需要将此处的 test_train 替换为您的 X_train

你可能会得到

DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64

错误引发了 object 类型。您可以使用

转换它们
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))

完成这些步骤后一切都会好起来的。

试试这个

train_store['StateHoliday'] = pd.to_numeric(train_store['StateHoliday'])
train_store['Assortment'] = pd.to_numeric(train_store['Assortment'])

H2O 包中的 XGBoost 版本可以处理分类变量(但不能太多!)但 XGBoost 作为它自己的包似乎不能。

我用 pandas 数据帧试过了,但 xgboost 不喜欢它

categoricals = ['StoreType', ] . # etc.
pdf[categorical] = pdf[categorical].astype('category')

要将 H2O 用于分类,您必须先将字符串转换为分类:

h2odf[categoricals] = h2odf[categoricals].asfactor()

还要注意,h2o 有自己的数据帧,不同于 pandas。