学习曲线是否显示过度拟合?
Do learning curves show overfitting?
我想知道我的 classifying 模型(二进制)是否过度拟合,我得到了学习曲线。数据集是:6836 个实例,其中 1006 个实例为阳性 class.
1) 如果我使用 SMOTE 来平衡 class 和 RandomForest 作为技术,我得到这条曲线,这些比率:TPR=0.887 y FPR=0.041:
请注意,训练误差是平坦的,几乎为 0。
2) 如果我使用函数 "balanced_subsample"(附在最后)来平衡 class 和 RandomForest 作为技术,我得到这条曲线,这些比率:TPR=0.866 y FPR =0.14:
请注意,在这种情况下,测试误差是平坦的。
- 模型是否过度拟合?
- 其中哪些更有意义?
函数"balanced_subsample":
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
EDIT1:有关代码和过程的更多信息
X = data
y = X.pop('myclass')
#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)
#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y)
#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)
#Estimator
clf=RandomForestClassifier(random_state=np.random.seed())
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)
#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
EDIT2:在这种情况下,我在 3 种情况下尝试使用梯度提升分类器 (GBC):1) GBC + SMOTE,2) GBC + SMOTE + 特征选择,以及 3) GBC + SMOTE + 特征选择 + 归一化
X = data
y = X.pop('myclass')
#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)
#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)
#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)
#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y)
#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)
#Estimator
clf=RandomForestClassifier(random_state=np.random.seed())
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)
#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
建议的 3 个场景的学习曲线是:
场景 1:
场景 2:GBC + SMOTE + 特征选择
场景 3:GBC + SMOTE + 特征选择 + 归一化
所以,你的第一条曲线是有道理的。您希望随着训练点数的增加,测试误差会下降。当你有一个没有最大深度和 100% 最大样本的 运行dom 森林时,你期望一致接近 0 的训练误差。你可能过度适应了,但你可能不会通过 RandomForests 变得更好(或者,根据数据集,其他任何东西)。
你的第二条曲线没有意义。你应该再次得到接近 0 的训练错误,除非发生了一些完全不稳定的事情(比如一个真正损坏的输入集)。我看不出你的代码有什么问题,我 运行 你的功能;似乎工作正常。除非您发布带有代码的完整工作示例,否则我无能为力。
我想知道我的 classifying 模型(二进制)是否过度拟合,我得到了学习曲线。数据集是:6836 个实例,其中 1006 个实例为阳性 class.
1) 如果我使用 SMOTE 来平衡 class 和 RandomForest 作为技术,我得到这条曲线,这些比率:TPR=0.887 y FPR=0.041:
请注意,训练误差是平坦的,几乎为 0。
2) 如果我使用函数 "balanced_subsample"(附在最后)来平衡 class 和 RandomForest 作为技术,我得到这条曲线,这些比率:TPR=0.866 y FPR =0.14:
请注意,在这种情况下,测试误差是平坦的。
- 模型是否过度拟合?
- 其中哪些更有意义?
函数"balanced_subsample":
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
EDIT1:有关代码和过程的更多信息
X = data
y = X.pop('myclass')
#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)
#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y)
#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)
#Estimator
clf=RandomForestClassifier(random_state=np.random.seed())
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)
#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
EDIT2:在这种情况下,我在 3 种情况下尝试使用梯度提升分类器 (GBC):1) GBC + SMOTE,2) GBC + SMOTE + 特征选择,以及 3) GBC + SMOTE + 特征选择 + 归一化
X = data
y = X.pop('myclass')
#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)
#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)
#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)
#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y)
#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)
#Estimator
clf=RandomForestClassifier(random_state=np.random.seed())
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)
#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)
建议的 3 个场景的学习曲线是:
场景 1:
场景 2:GBC + SMOTE + 特征选择
场景 3:GBC + SMOTE + 特征选择 + 归一化
所以,你的第一条曲线是有道理的。您希望随着训练点数的增加,测试误差会下降。当你有一个没有最大深度和 100% 最大样本的 运行dom 森林时,你期望一致接近 0 的训练误差。你可能过度适应了,但你可能不会通过 RandomForests 变得更好(或者,根据数据集,其他任何东西)。
你的第二条曲线没有意义。你应该再次得到接近 0 的训练错误,除非发生了一些完全不稳定的事情(比如一个真正损坏的输入集)。我看不出你的代码有什么问题,我 运行 你的功能;似乎工作正常。除非您发布带有代码的完整工作示例,否则我无能为力。