如何使用 sklearn 的 TimeSeriesSplit 进行训练？

Question

我有这样的数据（列）：

| year-month | client_id | Y | X1.. Xn |

其中 Y 表示客户 client_id 在给定的 year-month 中购买了产品。 X 是解释变量。我有两年的月度数据，并且我已经使用答案中给出的 TimeSeriesSplit() 正确地进行了拆分。现在的问题是，我想在那个拆分上做一个 GridSearchCV()，尝试使用不同的超参数的不同模型（RF、XGBoostClassifier()、LightGBM() 等），但我想不出一种方法来使用 GridSearchCV() 完成拆分。

有什么建议吗？

Answer 1

假设您有 splits 基于问题的 df。首先将每个 Fold 的索引保存到元组数组 (train,test) 中，即：

 [(train_indices, test_indices), # 1stfold
  (train_indices, test_indices)] # 2nd fold etc

以下代码将执行此操作：

custom_cv = []

for FOLD_train,FOLD_test in zip(splits['train'],splits['test']):
    custom_cv.append((np.array(FOLD_train.index.values.tolist()),np.array(FOLD_test.index.values.tolist())))

您可以按以下方式使用GridSearchCV()：

这里我们创建了带有分类器函数的字典和另一个带有参数列表的字典

这只是一个示例确保在测试时限制搜索space，

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {

    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Linear SVM": SVC(),
    "XGB": XGBRegressor(),
    "Logistic Regression": LogisticRegression(),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
}

params = {
    "Random Forest": {"max_depth": range(5, 30, 5), "min_samples_leaf": range(1, 30, 2),
                      "n_estimators": range(100, 2000, 200)},

    "Gradient Boosting Classifier": {"learning_rate": [0.001, 0.01, 0.1], "n_estimators": range(1000, 3000, 200)},
    "Linear SVM": {"kernel": ["rbf", "poly"], "gamma": ["auto", "scale"], "degree": range(1, 6, 1)},
    "XGB": {'min_child_weight': [1, 5, 10],
            'gamma': [0.5, 1, 1.5, 2, 5],
            'subsample': [0.6, 0.8, 1.0],
            'colsample_bytree': [0.6, 0.8, 1.0],
            'max_depth': [3, 4, 5], "n_estimators": [300, 600],
            "learning_rate": [0.001, 0.01, 0.1],
            },
    "Logistic Regression": {'penalty': ['none', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
    "Nearest Neighbors": {'n_neighbors': [3, 5, 11, 19], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},
    "Decision Tree": {'criterion': ['gini', 'entropy'], 'max_depth': np.arange(3, 15)},

}


for classifier_name in dict_classifiers.keys() & params:

    print("training: ", classifier_name)
    gridSearch = GridSearchCV(
        estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=custom_cv)
    gridSearch.fit(df[['X']].to_numpy(), # shoud have shape of (n_samples, n_features) 
                   df[['Y']].to_numpy().reshape((-1))) #this should be an array with shape (n_samples,)
    print(gridSearch.best_score_, gridSearch.best_params_)

将 ['X'] 替换为 gridsearch.fit 上的 df.columns[pd.Series(df.columns).str.startswith('X')]，如果您想要传递名称中以 'X' 开头的所有列（例如，'X1' ,'X2', ...) 作为 train_set.

如何使用 sklearn 的 TimeSeriesSplit 进行训练？

How to train with TimeSeriesSplit from sklearn?

classification

machine-learning

time-series

scikit-learn

grid-search