XGBoost 回归器无法使用字符串数据拟合模型

Question

我正在尝试使用 XGBoost 来预测一个目标（一个属性）数据帧。在我的代码下面。我运行它在 Colab

!sudo pip install xgboost
!sudo pip install --upgrade xgboost
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
data = [['sp37n1sy1bmjc6yp3m7wqefpz' ], ['sp36vfqtjv87pvw68zdmhnvxb'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36vues2ed9r6s196dmv4p00'], ['sp36vvgq6rq9sq1gv0nt19h20'], ['sp36ypgx7jmmsuujz2ww81n20'], ['sp37n1w451m6wtp6h4eq0wjb0'], ['sp36y99s6w9jm3614ugt52bpz'], ['sp37n1mywgv57qsg5r7hp7bpz'], ['sp36y9fbfz4t9c5znp27z3pbp']]
df = pd.DataFrame(data)
X = data[:-1]
y = data[1:]
X_train, X_test, y_train, y_test = train_test_split(X, y)
regressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3
)
regressor.fit(str(X_train), str(y_train))

但是返回如下错误：

XGBoostError: [17:00:27] /workspace/dmlc-core/src/io/local_filesys.cc:86: LocalFileSystem.GetPathInfo: [['sp36ypgx7jmmsuujz2ww81n20'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp37n1w451m6wtp6h4eq0wjb0'], ['sp36vvgq6rq9sq1gv0nt19h20'], ['sp36vfqtjv87pvw68zdmhnvxb'], ['sp37n1sy1bmjc6yp3m7wqefpz'], ['sp37n1mywgv57qsg5r7hp7bpz'], ['sp36vues2ed9r6s196dmv4p00'], ['sp36y99s6w9jm3614ugt52bpz']] error: File name too long
Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::io::LocalFileSystem::GetPathInfo(dmlc::io::URI const&)+0x567) [0x7f6f13f157c7]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::io::InputSplitBase::InitInputFileInfo(std::string const&, bool)+0x14e) [0x7f6f13f044de]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::io::InputSplitBase::Init(dmlc::io::FileSystem*, char const*, unsigned long, bool)+0x43) [0x7f6f13f04be3]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::InputSplit::Create(char const*, char const*, unsigned int, unsigned int, char const*, bool, int, unsigned long, bool)+0xb7a) [0x7f6f13eed18a]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::InputSplit::Create(char const*, unsigned int, unsigned int, char const*)+0x1e) [0x7f6f13eed81e]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::Parser<unsigned int, float>* dmlc::data::CreateLibSVMParser<unsigned int, float>(std::string const&, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&, unsigned int, unsigned int)+0x1a) [0x7f6f13ecb09a]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(dmlc::Parser<unsigned int, float>* dmlc::data::CreateParser_<unsigned int, float>(char const*, unsigned int, unsigned int, char const*)+0x15b) [0x7f6f13ebc23b]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(xgboost::DMatrix::Load(std::string const&, bool, bool, std::string const&, unsigned long)+0x2df) [0x7f6f13c91a0f]
  [bt] (8) /usr/local/lib/python3.7/dist-packages/xgboost/./lib/libxgboost.so(XGDMatrixCreateFromFile+0xc2) [0x7f6f13c5f5b2]

如果我将最后一行更改为

regressor.fit(X_train, y_train)

我收到这个错误：

TypeError: can not initialize DMatrix from list

我做错了什么？有什么线索吗？

Answer 1

XGBoost 无法处理分类变量，因此需要在传递给 XGBoost 模型之前对其进行编码。根据分类变量的性质，您可以通过多种方式对变量进行编码。因为我相信你的字符串有一些顺序所以标签编码适合你的分类变量：

完整代码：

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
data = [['sp37n1sy1bmjc6yp3m7wqefpz' ], ['sp36vfqtjv87pvw68zdmhnvxb'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36y965ksqnmq0b0b58y1p00'], ['sp36vues2ed9r6s196dmv4p00'], ['sp36vvgq6rq9sq1gv0nt19h20'], ['sp36ypgx7jmmsuujz2ww81n20'], ['sp37n1w451m6wtp6h4eq0wjb0'], ['sp36y99s6w9jm3614ugt52bpz'], ['sp37n1mywgv57qsg5r7hp7bpz'], ['sp36y9fbfz4t9c5znp27z3pbp']]
df = pd.DataFrame(data)
X = df[:-1]
y = df[1:]

le = LabelEncoder()

X = le.fit_transform(X)
y = le.fit_transform(y)

X = np.array(X).reshape(-1,1) #convert to 2D

X_train, X_test, y_train, y_test = train_test_split(X, y)

regressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3
)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)
y_predictions = [int(round(y,0)) for y in y_pred]
print("Encoded Predictions",y_predictions) #encoded predictions
print("String predictions",le.inverse_transform(y_predictions)) #original string predictions
print()
print("Encoded Actual value",y_test) #encoded
print("String Actual value",le.inverse_transform(y_test)) #original test values

XGBoost 回归器无法使用字符串数据拟合模型

XGBoost Regressor cannot fit the model using string data

python

xgboost

google-colaboratory