scikit 学习决策树模型评估
scikit learn decision tree model evaluation
这里有相关的代码和文档,想知道默认的cross_val_score
没有明确指定score
,输出的数组是precision,AUC还是其他一些指标?
使用 Python 2.7 和 miniconda 解释器。
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
此致,
林
如果未给出评分参数,cross_val_score
将默认使用您正在使用的估算器的 .score
方法。对于 DecisionTreeClassifier
,它是平均准确度(如下面的文档字符串所示):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns
-------
score : float
Mean accuracy of self.predict(X) wrt. y.
来自user guide:
By default, the score computed at each CV iteration is the score
method of the estimator. It is possible to change this by using the
scoring parameter:
来自 DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In
multi-label classification, this is the subset accuracy which is a
harsh metric since you require for each sample that each label set be
correctly predicted.
不要被 "mean accuracy," 搞糊涂了,它只是计算准确度的常规方法。按照 source:
的链接
from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
现在source metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
...
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
else:
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return np.dot(sample_score, sample_weight)
else:
return sample_score.sum()
注意:对于accuracy_score
normalize参数默认为True
,因此它只是布尔numpy数组的returns np.average
,因此它只是正确的平均数预测。
这里有相关的代码和文档,想知道默认的cross_val_score
没有明确指定score
,输出的数组是precision,AUC还是其他一些指标?
使用 Python 2.7 和 miniconda 解释器。
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
此致, 林
如果未给出评分参数,cross_val_score
将默认使用您正在使用的估算器的 .score
方法。对于 DecisionTreeClassifier
,它是平均准确度(如下面的文档字符串所示):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns
-------
score : float
Mean accuracy of self.predict(X) wrt. y.
来自user guide:
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
来自 DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
不要被 "mean accuracy," 搞糊涂了,它只是计算准确度的常规方法。按照 source:
的链接 from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
现在source metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
...
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
else:
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return np.dot(sample_score, sample_weight)
else:
return sample_score.sum()
注意:对于accuracy_score
normalize参数默认为True
,因此它只是布尔numpy数组的returns np.average
,因此它只是正确的平均数预测。