如何将已经训练好的 XGBoost 模型加载到新数据集上的运行？

Question

XGBoost 新手，请见谅。我已经在波士顿住房数据集上训练了一个模型并将其保存在本地。现在，我想加载模型，并使用结构相似的新数据集来预测它们的标签。我将如何在 Python 3.6 中执行此操作？到目前为止，我从训练步骤中得到了这个：

已更新尝试泡菜

更新 2：添加错误原因，预处理。

更新 3：请参阅下面的评论以获得答案

    print('Splitting the features and label columns...')
    X, y = data.iloc[:,:-1],data.iloc[:,-1]

    print('Converting dataset to Dmatrix structure to use later on...')
    data_dmatrix = xgb.DMatrix(data=X,label=y)
    #....
    # Some more stuff here.
    #....
    print('Now, train the model...')
    grid = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

    # Now, save the model for later use on unseen data
    import pickle
    model = pickle.dump(grid, open("pima.pickle.dat", "wb"))

    #.....after some time has passed

    # Now, load the model for use on a new dataset
    loaded_model = pickle.load(open("pima.pickle.dat", "rb"))
    print(loaded_model.feature_names)

    # Now, load a new dataset to run the model on and make predictions for
    dataset = pd.read_csv('Boston Housing Data.csv', skiprows=1))

    # Split the dataset into features and label
    # X = use all rows, up until the last column, which is the label or predicted column
    # y = use all rows in the last column of the dataframe ('Price')
    print('Splitting the new features and label column up for predictions...')
    X, y = dataset.iloc[:,:-1],dataset.iloc[:,-1]


    # Make predictions on labels of the test set
    preds = loaded_model.predict(X)

现在我得到了回溯：

        preds = loaded_model.predict(X)
    AttributeError: 'DataFrame' object has no attribute 'feature_names'

有什么想法吗？我注意到当我打印 loaded_model.feature_names 时，我得到：

['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

...但实际的 .csv 文件有一个额外的列 'PRICE'，它是在训练之前附加的，并在训练期间用作标签。这意味着什么吗？

我不认为我必须经历整个训练和测试拆分的事情，因为我不想真正重新训练模型，只是在这个新数据集上使用它来进行预测，并且显示新数据集上实际值的 RMSE。我在网上看到的所有教程都不涉及在新数据上实施模型的步骤。想法？谢谢！

Answer 1

您需要对训练集和测试集使用相同的预处理才能进行任何类型的预测。你的问题是因为，你在训练中使用了 DMatrix 结构，顺便说一句，这是必需的。

print('Converting dataset to Dmatrix structure to use later on...')
    data_dmatrix = xgb.DMatrix(data=X,label=y)

但未能在测试集上使用该预处理。对所有训练集、验证集和测试集使用相同的预处理。您的模型将是金色的。

如何将已经训练好的 XGBoost 模型加载到新数据集上的运行？

How to load an already trained XGBoost model to run on a new dataset?

python

python-3.x

xgboost

如何将已经训练好的 XGBoost 模型加载到新数据集上的 运行？

How to load an already trained XGBoost model to run on a new dataset?

python

python-3.x

xgboost

如何将已经训练好的 XGBoost 模型加载到新数据集上的运行？