XGBoost 在不同数据集中的训练循环

Training loop for XGBoost in different dataset

我开发了一些不同的数据集,我想编写一个 for 循环来对每个数据集进行训练,最后,我希望每个数据集都有 RMSE。我尝试通过 for 循环,但它不起作用,因为它为每个数据集返回相同的值,而我知道它应该不同。我写的代码如下:

for i in NEW_middle_index:
    DF = df1.iloc[i-100:i+100,:]
    # Append an empty sublist inside the list
    FINAL_DF.append(DF)
      
    y = DF.iloc[:,3]
    X = DF.drop(columns='Target')
    

    index_train = int(0.7 * len(X))


    X_train = X[:index_train]
    y_train = y[:index_train]

    X_test = X[index_train:]
    y_test = y[index_train:]
    
    scaler_x = MinMaxScaler().fit(X_train)
    X_train = scaler_x.transform(X_train)
    X_test  = scaler_x.transform(X_test)

xgb_r = xg.XGBRegressor(objective ='reg:linear',
                    n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
#     print(i)
  
    # Fitting the model
    xgb_r.fit(X_train,y_train)

    # Predict the model
    pred = xgb_r.predict(X_test)

    # RMSE Computation
    rmse = np.sqrt(mean_squared_error(y_test,pred))
    # print(rmse)
    RMSE.append(rmse)

不确定是否缩进正确。您正在覆盖 X_trainX_test,当您拟合模型时,它总是在同一个数据集上,因此您得到相同的结果。

一种选择是在创建训练/测试数据帧后拟合模型。否则,如果你想保留训练集/测试集,可能像下面这样,将它们存储在字典列表中,而无需更改太多代码:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg

df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)

NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
    
    y = df1.iloc[i-100:i+100:,3]
    X = df1.iloc[i-100:i+100,:].drop(columns='Target')
    
    index_train = int(0.7 * len(X))
    scaler_x = MinMaxScaler().fit(X)

    X_train = scaler_x.transform(X[:index_train])
    y_train = y[:index_train]

    X_test = scaler_x.transform(X[index_train:])
    y_test = y[index_train:]
    
    NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})

然后我们拟合并计算RMSE:

RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)

for i in range(len(NEW_middle_index)):

    xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
    pred = xgb_r.predict(NEWDF[i]['X_test'])
    rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
    RMSE.append(rmse)
    
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]