scikit 学习决策树模型评估

scikit learn decision tree model evaluation

这里有相关的代码和文档,想知道默认的cross_val_score没有明确指定score,输出的数组是precision,AUC还是其他一些指标?

使用 Python 2.7 和 miniconda 解释器。

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

此致, 林

如果未给出评分参数,cross_val_score 将默认使用您正在使用的估算器的 .score 方法。对于 DecisionTreeClassifier,它是平均准确度(如下面的文档字符串所示):

In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like, shape = (n_samples, n_features)
    Test samples.

y : array-like, shape = (n_samples) or (n_samples, n_outputs)
    True labels for X.

sample_weight : array-like, shape = [n_samples], optional
    Sample weights.

Returns
-------
score : float
    Mean accuracy of self.predict(X) wrt. y.

来自user guide

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

来自 DecisionTreeClassifier documentation:

Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

不要被 "mean accuracy," 搞糊涂了,它只是计算准确度的常规方法。按照 source:

的链接
    from .metrics import accuracy_score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

现在source metrics.accuracy_score

def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
    ...
    # Compute accuracy for each possible representation
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    if y_type.startswith('multilabel'):
        differing_labels = count_nonzero(y_true - y_pred, axis=1)
        score = differing_labels == 0
    else:
        score = y_true == y_pred

    return _weighted_sum(score, sample_weight, normalize)

如果你 still aren't convinced:

def _weighted_sum(sample_score, sample_weight, normalize=False):
    if normalize:
        return np.average(sample_score, weights=sample_weight)
    elif sample_weight is not None:
        return np.dot(sample_score, sample_weight)
    else:
        return sample_score.sum()

注意:对于accuracy_score normalize参数默认为True,因此它只是布尔numpy数组的returns np.average,因此它只是正确的平均数预测。