决策树过拟合测试

Decision tree overfit test

我目前正在处理容易过度拟合的数据,所以我通过测试每个深度的 roc_auc 得分来实现功能,因为我在 sklearn 上读到 max_depth 通常是树的原因过拟合。但我不确定我的想法是否正确这里有我的结果图片:

我也在尝试使用后剪枝方法,但我的图表看起来与我在互联网上找到的其他图表完全不同,所以我不确定它给我带来了什么

您要查找的字词是cross-validation. The basic idea is simple: you split your dataset into a training and validation (or testing) sets. Then you train a model on the training set and test it on the validation set. If your model is overfitted, it will perform well on training set but poorly on validation set. In this case it's best to decrease model complexity or add so-called regularization (e.g. tree pruning). Perhaps, the simplest way to perform cross-validation in SciKit Learn is to use cross_val_score function as described here

注 1:在某些情况下(例如在神经网络中)同时存在验证集和测试集(除了训练集)。这里就不细说了,大家不要把这些术语在不同的语境中搞混了。

注意 2: Cross-validation 是一个标准的东西,它甚至给另一个 StackExchange 站点起了名字 - Cross Validated, where you may get more answers about statistics. Another and perhaps even more appropriate site has a self-explaining name - Data Science.