使用 Scikit-Learn 的 GridSearchCV 捕获所有排列的精度、召回率和 f1?
Use Scikit-Learn's GridSearchCV to capture precision, recall, and f1 for all permutations?
我想使用 Scikit-Learn 的 GridSearchCV 进行 运行 一堆实验,然后打印出每个实验的召回率、精确率和 f1。
这篇文章(https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)提示我需要多次运行.fit
和.predict
...
scores = ['precision', 'recall']
...
for score in scores:
...
clf = GridSearchCV(
SVC(), tuned_parameters, scoring='%s_macro' % score
)
clf.fit(X_train, y_train) # running for each scoring metric
...
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
...
y_true, y_pred = y_test, clf.predict(X_test) # running for each scoring metric
print(classification_report(y_true, y_pred))
我只想 运行 .fit
一次并记录所有召回率、精确率和 f1 指标。因此,例如,类似以下内容的内容:
clf = GridSearchCV(
SVC(), tuned_parameters, scoring=['recall', 'precision', 'f1'] # I don't think this syntax is even possible
)
clf.fit(X_train, y_train)
for metric in clf.something_that_i_cannot_find:
### does something like this exist?
print(metric['precision']
print(metric['recall'])
print(metric['f1'])
###:end does something like this exist?
或者甚至:
...
for run in clf.something_that_i_cannot_find:
### does something like this exist?
print(classification_report(run.y_true, run.y_pred))
###:end does something like this exist?
这篇文章 () 建议可以让 GridSearchCV 了解多个评分者,但我仍然无法弄清楚如何访问所有实验的每个评分。
GridSearchCV 不支持我正在查找的内容吗?文章中使用的方法(即 运行 多次 .fit
和 .predict
)是完成与我要求的类似的事情的最简单方法吗?
感谢您的宝贵时间
您将不得不手动执行此操作,这将需要大量代码来使用 sklearn 和另一个参数的多个循环来循环折叠。我建议为折叠策略、网格搜索和模型设置随机状态,并且 运行 每个指标的网格搜索 3 次。
您可以对二元分类进行多指标评估。我在 iris dataset
.
上尝试实施时遇到了 ValueError: Multi-class not supported
我已经在下面的基本二进制数据上实现了,我正在计算四个不同的分数,
['AUC', 'F1', 'Precision', 'Recall']
注意:这个想法不是要使用模型的推论,而只是为了展示多指标评估的工作原理。数据只是随机数据。
X, y = datasets.make_classification(n_classes=2, random_state=0)
# The scorers can be either one of the predefined metric strings or a scorer
# callable, like the one returned by make_scorer
f1_scorer = make_scorer(f1_score, average='binary')
scoring = {'AUC': 'roc_auc', 'F1': 'f1_micro', 'Precision': 'precision', 'Recall':'recall'}
# split data to train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = GridSearchCV(
SVC(),
param_grid={'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
scoring=scoring,
refit='AUC',
return_train_score=True
)
clf.fit(X_train, y_train)
results = clf.cv_results_
**Plotting the result**
plt.figure(figsize=(10, 10))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
fontsize=16)
plt.xlabel("min_samples_split")
plt.ylabel("Score")
ax = plt.gca()
ax.set_xlim(1, 1000)
ax.set_ylim(0.40, 1)
# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_C'].data, dtype=float)
for scorer, color in zip(sorted(scoring), ['g', 'k', 'b', 'r']):
for sample, style in (('train', '--'), ('test', '-')):
sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
sample_score_std = results['std_%s_%s' % (sample, scorer)]
ax.fill_between(X_axis, sample_score_mean - sample_score_std,
sample_score_mean + sample_score_std,
alpha=0.1 if sample == 'test' else 0, color=color)
ax.plot(X_axis, sample_score_mean, style, color=color,
alpha=1 if sample == 'test' else 0.7,
label="%s (%s)" % (scorer, sample))
best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
best_score = results['mean_test_%s' % scorer][best_index]
# Plot a dotted vertical line at the best score for that scorer marked by x
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)
# Annotate the best score for that scorer
ax.annotate("%0.2f" % best_score,
(X_axis[best_index], best_score + 0.005))
plt.legend(loc="best")
plt.grid(False)
plt.show()
输出图
我想使用 Scikit-Learn 的 GridSearchCV 进行 运行 一堆实验,然后打印出每个实验的召回率、精确率和 f1。
这篇文章(https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)提示我需要多次运行.fit
和.predict
...
scores = ['precision', 'recall']
...
for score in scores:
...
clf = GridSearchCV(
SVC(), tuned_parameters, scoring='%s_macro' % score
)
clf.fit(X_train, y_train) # running for each scoring metric
...
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
...
y_true, y_pred = y_test, clf.predict(X_test) # running for each scoring metric
print(classification_report(y_true, y_pred))
我只想 运行 .fit
一次并记录所有召回率、精确率和 f1 指标。因此,例如,类似以下内容的内容:
clf = GridSearchCV(
SVC(), tuned_parameters, scoring=['recall', 'precision', 'f1'] # I don't think this syntax is even possible
)
clf.fit(X_train, y_train)
for metric in clf.something_that_i_cannot_find:
### does something like this exist?
print(metric['precision']
print(metric['recall'])
print(metric['f1'])
###:end does something like this exist?
或者甚至:
...
for run in clf.something_that_i_cannot_find:
### does something like this exist?
print(classification_report(run.y_true, run.y_pred))
###:end does something like this exist?
这篇文章 (
GridSearchCV 不支持我正在查找的内容吗?文章中使用的方法(即 运行 多次 .fit
和 .predict
)是完成与我要求的类似的事情的最简单方法吗?
感谢您的宝贵时间
您将不得不手动执行此操作,这将需要大量代码来使用 sklearn 和另一个参数的多个循环来循环折叠。我建议为折叠策略、网格搜索和模型设置随机状态,并且 运行 每个指标的网格搜索 3 次。
您可以对二元分类进行多指标评估。我在 iris dataset
.
ValueError: Multi-class not supported
我已经在下面的基本二进制数据上实现了,我正在计算四个不同的分数,
['AUC', 'F1', 'Precision', 'Recall']
注意:这个想法不是要使用模型的推论,而只是为了展示多指标评估的工作原理。数据只是随机数据。
X, y = datasets.make_classification(n_classes=2, random_state=0)
# The scorers can be either one of the predefined metric strings or a scorer
# callable, like the one returned by make_scorer
f1_scorer = make_scorer(f1_score, average='binary')
scoring = {'AUC': 'roc_auc', 'F1': 'f1_micro', 'Precision': 'precision', 'Recall':'recall'}
# split data to train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = GridSearchCV(
SVC(),
param_grid={'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
scoring=scoring,
refit='AUC',
return_train_score=True
)
clf.fit(X_train, y_train)
results = clf.cv_results_
**Plotting the result**
plt.figure(figsize=(10, 10))
plt.title("GridSearchCV evaluating using multiple scorers simultaneously",
fontsize=16)
plt.xlabel("min_samples_split")
plt.ylabel("Score")
ax = plt.gca()
ax.set_xlim(1, 1000)
ax.set_ylim(0.40, 1)
# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_C'].data, dtype=float)
for scorer, color in zip(sorted(scoring), ['g', 'k', 'b', 'r']):
for sample, style in (('train', '--'), ('test', '-')):
sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
sample_score_std = results['std_%s_%s' % (sample, scorer)]
ax.fill_between(X_axis, sample_score_mean - sample_score_std,
sample_score_mean + sample_score_std,
alpha=0.1 if sample == 'test' else 0, color=color)
ax.plot(X_axis, sample_score_mean, style, color=color,
alpha=1 if sample == 'test' else 0.7,
label="%s (%s)" % (scorer, sample))
best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
best_score = results['mean_test_%s' % scorer][best_index]
# Plot a dotted vertical line at the best score for that scorer marked by x
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)
# Annotate the best score for that scorer
ax.annotate("%0.2f" % best_score,
(X_axis[best_index], best_score + 0.005))
plt.legend(loc="best")
plt.grid(False)
plt.show()
输出图