为什么 regions/decision 边界与在 sci-kit 中使用 SVM 的多重 class class 化重叠?
Why are the regions/decision boundaries overlapping with multi-class classification using SVM in sci-kit?
我正在使用 scikit-learn 库中的 SVM 进行多class class化。我想知道为什么这些区域(决策边界)重叠(如下图所示)?
Results
谁能解释一下我是一对一还是一对一在区域重叠方面的区别?我假设一对一会清楚地描绘出没有重叠的区域,因为它最大化了彼此之间的边距 class 并且一对一可能有重叠的区域,但这也许是不准确的,因为 4 个中的 3 个我正在训练的模型是一对一的,它们显示重叠区域。
我也考虑过这可能是一个策划问题,但无法确定任何问题。如果 alpha 为 1,则区域不再重叠,但我认为这是预期的,因为它只是覆盖了它覆盖的其他区域(这是预期的,并不能解决问题)。
这是创建、训练和绘制 4 个不同 SVM 模型的函数#(3 个不同的内核使用 SVC,1 个内核使用 LinearSVC)。
def createSVMandPlot(X,y,x_name,y_name):
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y) #1 vs 1
rbf_svc = svm.SVC(kernel='rbf', gamma='scale', C=C).fit(X, y) #1v1
poly_svc = svm.SVC(kernel='poly', degree=3, gamma='scale',C=C).fit(X, y) #1v1
lin_svc = svm.LinearSVC(C=C).fit(X, y) #1 vs rest
print(str(x_name)+' vs. '+str(y_name))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
X_pred=clf.predict(X)
X_pred1=np.asarray(X_pred).reshape(len(X_pred),1)
A=confusion_matrix(X_pred1, y)
print(A)
c=0
for r in range(len(X_pred)):
if X_pred[r]==y[r]:
c+=1
print(str(c)+' out of 34 predicted correctly (true positives)')
=============================================================================
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
=============================================================================
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC w/ linear kernel',
'LinearSVC (w/ linear kernel)',
'SVM w/ RBF kernel',
'SVM w/ poly(degree 3) kernel']
plt.pause(7)
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=.5)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], s=13,c=y)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
此结果是决策 bounaries/regions 重叠的图像。这意味着如果一个点位于特定的二维坐标 (x1,y1),则它可以 class 化为两个或多个 class 而不是仅一个不是期望或预期的.有人可以解释可能发生的事情吗?谢谢
编辑:我附上了一张决策边界重叠的结果图片。
多类 SVM 无法避免区域重叠。从 https://arxiv.org/ftp/arxiv/papers/0711/0711.2914.pdf ,你有一个相当清楚的解释:
As mentioned before, SVM classification is essentially a binary (two-class) classification technique,
which has to be modified to handle the multiclass tasks in real world situations e.g. derivation of
land cover information from satellite images. Two of the common methods to enable this adaptation
include the 1A1 and 1AA techniques. The 1AA approach represents the earliest and most common
SVM multiclass approach (Melgani and Bruzzone, 2004) and involves the division of an N class
dataset into N two-class cases. If say the classes of interest in a satellite image include water,
vegetation and built up areas, classification would be effected by classifying water against non-water
areas i.e. (vegetation and built up areas) or vegetation against non-vegetative areas i.e. (water and
built up areas). The 1A1 approach on the other hand involves constructing a machine for each pair of
classes resulting in N(N-1)/2 machines. When applied to a test point, each classification gives one
vote to the winning class and the point is labeled with the class having most votes. This approach
can be further modified to give weighting to the voting process. From machine learning theory, it
is acknowledged that the disadvantage the 1AA approach has over 1A1 is that its performance can
be compromised due to unbalanced training datasets (Gualtieri and Cromp, 1998), however, the 1A1
approach is more computationally intensive since the results of more SVM pairs ought to be
computed. In this paper, the performance of these two techniques are compared and evaluated to
establish their performance on the extraction of land cover information from satellite images.
所以你有 N 个分类器,或 N(N-1)/2 个分类器,它们使用整个可用 space。由于这些(出于本解释的目的)是独立的,因此使决策边界不交叉的唯一方法是具有平行的决策边界,即使这样区域也会重叠(我觉得这句话可能不是最清楚的,如果需要,请不要犹豫,要求更多解释)。
如果你想要清晰的非重叠区域,我建议你使用另一种能更好地处理多类问题的算法,例如 KNN。
我正在使用 scikit-learn 库中的 SVM 进行多class class化。我想知道为什么这些区域(决策边界)重叠(如下图所示)?
Results
谁能解释一下我是一对一还是一对一在区域重叠方面的区别?我假设一对一会清楚地描绘出没有重叠的区域,因为它最大化了彼此之间的边距 class 并且一对一可能有重叠的区域,但这也许是不准确的,因为 4 个中的 3 个我正在训练的模型是一对一的,它们显示重叠区域。
我也考虑过这可能是一个策划问题,但无法确定任何问题。如果 alpha 为 1,则区域不再重叠,但我认为这是预期的,因为它只是覆盖了它覆盖的其他区域(这是预期的,并不能解决问题)。
这是创建、训练和绘制 4 个不同 SVM 模型的函数#(3 个不同的内核使用 SVC,1 个内核使用 LinearSVC)。
def createSVMandPlot(X,y,x_name,y_name):
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y) #1 vs 1
rbf_svc = svm.SVC(kernel='rbf', gamma='scale', C=C).fit(X, y) #1v1
poly_svc = svm.SVC(kernel='poly', degree=3, gamma='scale',C=C).fit(X, y) #1v1
lin_svc = svm.LinearSVC(C=C).fit(X, y) #1 vs rest
print(str(x_name)+' vs. '+str(y_name))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
X_pred=clf.predict(X)
X_pred1=np.asarray(X_pred).reshape(len(X_pred),1)
A=confusion_matrix(X_pred1, y)
print(A)
c=0
for r in range(len(X_pred)):
if X_pred[r]==y[r]:
c+=1
print(str(c)+' out of 34 predicted correctly (true positives)')
=============================================================================
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
=============================================================================
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC w/ linear kernel',
'LinearSVC (w/ linear kernel)',
'SVM w/ RBF kernel',
'SVM w/ poly(degree 3) kernel']
plt.pause(7)
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
# point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=.5)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], s=13,c=y)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
此结果是决策 bounaries/regions 重叠的图像。这意味着如果一个点位于特定的二维坐标 (x1,y1),则它可以 class 化为两个或多个 class 而不是仅一个不是期望或预期的.有人可以解释可能发生的事情吗?谢谢
编辑:我附上了一张决策边界重叠的结果图片。
多类 SVM 无法避免区域重叠。从 https://arxiv.org/ftp/arxiv/papers/0711/0711.2914.pdf ,你有一个相当清楚的解释:
As mentioned before, SVM classification is essentially a binary (two-class) classification technique, which has to be modified to handle the multiclass tasks in real world situations e.g. derivation of land cover information from satellite images. Two of the common methods to enable this adaptation include the 1A1 and 1AA techniques. The 1AA approach represents the earliest and most common SVM multiclass approach (Melgani and Bruzzone, 2004) and involves the division of an N class dataset into N two-class cases. If say the classes of interest in a satellite image include water, vegetation and built up areas, classification would be effected by classifying water against non-water areas i.e. (vegetation and built up areas) or vegetation against non-vegetative areas i.e. (water and built up areas). The 1A1 approach on the other hand involves constructing a machine for each pair of classes resulting in N(N-1)/2 machines. When applied to a test point, each classification gives one vote to the winning class and the point is labeled with the class having most votes. This approach can be further modified to give weighting to the voting process. From machine learning theory, it is acknowledged that the disadvantage the 1AA approach has over 1A1 is that its performance can be compromised due to unbalanced training datasets (Gualtieri and Cromp, 1998), however, the 1A1 approach is more computationally intensive since the results of more SVM pairs ought to be computed. In this paper, the performance of these two techniques are compared and evaluated to establish their performance on the extraction of land cover information from satellite images.
所以你有 N 个分类器,或 N(N-1)/2 个分类器,它们使用整个可用 space。由于这些(出于本解释的目的)是独立的,因此使决策边界不交叉的唯一方法是具有平行的决策边界,即使这样区域也会重叠(我觉得这句话可能不是最清楚的,如果需要,请不要犹豫,要求更多解释)。
如果你想要清晰的非重叠区域,我建议你使用另一种能更好地处理多类问题的算法,例如 KNN。