使用 OneVsRestClassifier 时,哪个 decision_function_shape for sklearn.svm.SVC?
Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?
我正在进行多标签分类,我正在尝试为问题预测正确的标签:
(X = 问题,y = X 中每个问题的标签列表)。
我想知道 sklearn.svm.SVC
should be be used with OneVsRestClassifier
哪个 decision_function_shape
?
从文档中我们可以读到 decision_function_shape
可以有两个值 'ovo'
和 'ovr'
:
decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original
one-vs-one (‘ovo’) decision function of libsvm which has shape
(n_samples, n_classes * (n_classes - 1) / 2). The default of None will
currently behave as ‘ovo’ for backward compatibility and raise a
deprecation warning, but will change ‘ovr’ in 0.19.
但是我还是不明白这两者有什么区别:
# First decision_function_shape set to 'ovo'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovo'))
# Second decision_function_shape set to 'ovr'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovr'))
哪个 decision_function_shape
应该用于 multi-label classification 问题?
编辑: Question 问类似的问题但没有答案。
我认为应该使用哪个的问题最好视情况而定。这很容易成为您的 GridSearch 的一部分。但凭直觉,我会觉得就差异而言,你们会做同样的事情。这是我的推理:
OneVsRestClassifier
旨在针对所有其他 class 独立地对每个 class 进行建模,并为每种情况创建一个 classifier。我理解这个过程的方式是 OneVsRestClassifier
抓取一个 class,并为一个点是否是 class 创建一个二进制标签。然后这个标签被输入到你选择使用的任何估计器中。我相信混淆是因为 SVC
也允许你做出同样的选择,但实际上对于这个实现,选择并不重要,因为你总是只会将两个 classes 送入 SVC
。
这是一个例子:
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
data = load_iris()
X, y = data.data, data.target
estim1 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovo'))
estim1.fit(X,y)
estim2 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovr'))
estim2.fit(X,y)
print(estim1.coef_ == estim2.coef_)
array([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]], dtype=bool)
因此您可以看到两个模型构建的所有三个估计量的系数都相等。假设此数据集只有 150 个样本和 3 个 classes,因此对于更复杂的数据集,这些结果可能会有所不同,但这是一个简单的概念证明。
决策函数的形状不同,因为 ovo
为每个 2 对 class 组合 训练了一个 classifier,而 ovr
为每个 class 训练一个 classifier 针对所有其他 classes.
我能找到的最好的例子是 found here on http://scikit-learn.org:
SVC and NuSVC implement the “one-against-one” approach (Knerr et al.,
1990) for multi- class classification. If n_class
is the number of
classes, then n_class * (n_class - 1) / 2
classifiers are constructed
and each one trains data from two classes. To provide a consistent
interface with other classifiers, the decision_function_shape
option
allows to aggregate the results of the “one-against-one” classifiers
to a decision function of shape (n_samples, n_classes)
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4
简单来说这是什么意思?
要理解 n_class * (n_class - 1) / 2
的含义,请使用 itertools.combinations
.
生成两个 class 组合
def ovo_classifiers(classes):
import itertools
n_class = len(classes)
n = n_class * (n_class - 1) / 2
combos = itertools.combinations(classes, 2)
return (n, list(combos))
>>> ovo_classifiers(['a', 'b', 'c'])
(3.0, [('a', 'b'), ('a', 'c'), ('b', 'c')])
>>> ovo_classifiers(['a', 'b', 'c', 'd'])
(6.0, [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')])
应该使用哪个估计器进行多标签class化?
在您的情况下,您有一个带有多个标签的问题(如 Whosebug 上的此处)。如果您提前知道您的标签 (classes),我可能会建议 OneVsRestClassifier(LinearSVC())
,但您可以尝试 DecisionTreeClassifier 或 RandomForestClassifier(我认为):
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
df = pd.DataFrame({
'Tags': [['python', 'pandas'], ['c#', '.net'], ['ruby'],
['python'], ['c#'], ['sklearn', 'python']],
'Questions': ['This is a post about python and pandas is great.',
'This is a c# post and i hate .net',
'What is ruby on rails?', 'who else loves python',
'where to learn c#', 'sklearn is a python package for machine learning']},
columns=['Questions', 'Tags'])
X = df['Questions']
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['Tags'].values)
pipeline = Pipeline([
('vect', CountVectorizer(token_pattern='|'.join(mlb.classes_))),
('linear_svc', OneVsRestClassifier(LinearSVC()))
])
pipeline.fit(X, y)
final = pd.DataFrame(pipeline.predict(X), index=X, columns=mlb.classes_)
def predict(text):
return pd.DataFrame(pipeline.predict(text), index=text, columns=mlb.classes_)
test = ['is python better than c#', 'should i learn c#',
'should i learn sklearn or tensorflow',
'ruby or c# i am a dinosaur',
'is .net still relevant']
print(predict(test))
输出:
.net c# pandas python ruby sklearn
is python better than c# 0 1 0 1 0 0
should i learn c# 0 1 0 0 0 0
should i learn sklearn or tensorflow 0 0 0 0 0 1
ruby or c# i am a dinosaur 0 1 0 0 1 0
is .net still relevant 1 0 0 0 0 0
我正在进行多标签分类,我正在尝试为问题预测正确的标签:
(X = 问题,y = X 中每个问题的标签列表)。
我想知道 sklearn.svm.SVC
should be be used with OneVsRestClassifier
哪个 decision_function_shape
?
从文档中我们可以读到 decision_function_shape
可以有两个值 'ovo'
和 'ovr'
:
decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None
Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). The default of None will currently behave as ‘ovo’ for backward compatibility and raise a deprecation warning, but will change ‘ovr’ in 0.19.
但是我还是不明白这两者有什么区别:
# First decision_function_shape set to 'ovo'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovo'))
# Second decision_function_shape set to 'ovr'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovr'))
哪个 decision_function_shape
应该用于 multi-label classification 问题?
编辑: Question 问类似的问题但没有答案。
我认为应该使用哪个的问题最好视情况而定。这很容易成为您的 GridSearch 的一部分。但凭直觉,我会觉得就差异而言,你们会做同样的事情。这是我的推理:
OneVsRestClassifier
旨在针对所有其他 class 独立地对每个 class 进行建模,并为每种情况创建一个 classifier。我理解这个过程的方式是 OneVsRestClassifier
抓取一个 class,并为一个点是否是 class 创建一个二进制标签。然后这个标签被输入到你选择使用的任何估计器中。我相信混淆是因为 SVC
也允许你做出同样的选择,但实际上对于这个实现,选择并不重要,因为你总是只会将两个 classes 送入 SVC
。
这是一个例子:
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
data = load_iris()
X, y = data.data, data.target
estim1 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovo'))
estim1.fit(X,y)
estim2 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovr'))
estim2.fit(X,y)
print(estim1.coef_ == estim2.coef_)
array([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]], dtype=bool)
因此您可以看到两个模型构建的所有三个估计量的系数都相等。假设此数据集只有 150 个样本和 3 个 classes,因此对于更复杂的数据集,这些结果可能会有所不同,但这是一个简单的概念证明。
决策函数的形状不同,因为 ovo
为每个 2 对 class 组合 训练了一个 classifier,而 ovr
为每个 class 训练一个 classifier 针对所有其他 classes.
我能找到的最好的例子是 found here on http://scikit-learn.org:
SVC and NuSVC implement the “one-against-one” approach (Knerr et al., 1990) for multi- class classification. If
n_class
is the number of classes, thenn_class * (n_class - 1) / 2
classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, thedecision_function_shape
option allows to aggregate the results of the “one-against-one” classifiers to a decision function of shape (n_samples, n_classes)
>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4
简单来说这是什么意思?
要理解 n_class * (n_class - 1) / 2
的含义,请使用 itertools.combinations
.
def ovo_classifiers(classes):
import itertools
n_class = len(classes)
n = n_class * (n_class - 1) / 2
combos = itertools.combinations(classes, 2)
return (n, list(combos))
>>> ovo_classifiers(['a', 'b', 'c'])
(3.0, [('a', 'b'), ('a', 'c'), ('b', 'c')])
>>> ovo_classifiers(['a', 'b', 'c', 'd'])
(6.0, [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')])
应该使用哪个估计器进行多标签class化?
在您的情况下,您有一个带有多个标签的问题(如 Whosebug 上的此处)。如果您提前知道您的标签 (classes),我可能会建议 OneVsRestClassifier(LinearSVC())
,但您可以尝试 DecisionTreeClassifier 或 RandomForestClassifier(我认为):
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
df = pd.DataFrame({
'Tags': [['python', 'pandas'], ['c#', '.net'], ['ruby'],
['python'], ['c#'], ['sklearn', 'python']],
'Questions': ['This is a post about python and pandas is great.',
'This is a c# post and i hate .net',
'What is ruby on rails?', 'who else loves python',
'where to learn c#', 'sklearn is a python package for machine learning']},
columns=['Questions', 'Tags'])
X = df['Questions']
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['Tags'].values)
pipeline = Pipeline([
('vect', CountVectorizer(token_pattern='|'.join(mlb.classes_))),
('linear_svc', OneVsRestClassifier(LinearSVC()))
])
pipeline.fit(X, y)
final = pd.DataFrame(pipeline.predict(X), index=X, columns=mlb.classes_)
def predict(text):
return pd.DataFrame(pipeline.predict(text), index=text, columns=mlb.classes_)
test = ['is python better than c#', 'should i learn c#',
'should i learn sklearn or tensorflow',
'ruby or c# i am a dinosaur',
'is .net still relevant']
print(predict(test))
输出:
.net c# pandas python ruby sklearn
is python better than c# 0 1 0 1 0 0
should i learn c# 0 1 0 0 0 0
should i learn sklearn or tensorflow 0 0 0 0 0 1
ruby or c# i am a dinosaur 0 1 0 0 1 0
is .net still relevant 1 0 0 0 0 0