选择数据点邻域以支持向量

Question

我一直在想这个，但不知道该怎么做。我有一个二进制不平衡数据，并且想使用 svm 到 select 最接近 support vector 的大多数数据点的子集。此后，我可以在这个“平衡”数据上安装二进制 classifier。

为了说明我的意思，MWE：

# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns 

# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
    n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)

# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})

# fit svm model 
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)

plt.figure(figsize=(10, 8))

# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)

# Constructing a hyperplane using a formula.
w = svc_model.coef_[0]           # w consists of 2 elements
b = svc_model.intercept_[0]      # b consists of 1 element
x_points = np.linspace(-1, 1)    # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1]  # getting corresponding y-points

# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')

这两个class被超平面很好地分开了。我们可以看到两个 classes 的 support vectors（class 1 更好）。

由于少数 class 0 有 9-data-points，我想通过 select 对其 support vectors 和 [=19= 进行下采样 class 0 ] 离它最近的其他数据点。因此 class 分布变为 {0: 9, 1: 9} 忽略 0 的所有其他数据点。然后，我将使用它来拟合二进制 classifier，例如 LR（甚至 SVC）。

我的问题是如何selectclass 0的那些数据点最接近classsupport vector，考虑到，一种与数据达到平衡的方法少数派 class 1.

Answer 1

可以这样实现：获取class0,(sv0)的支持向量，遍历class0(X[y == 0]中的所有数据点), 计算距离 (d) 到支持向量表示的点，对它们进行排序，取最小值的 9，并将它们与 class 1 的点连接以创建下采样数据(X_ds, y_ds).

sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
    d = np.linalg.norm(sv0 - x)
    distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))

plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)

选择数据点邻域以支持向量

selecting data points neighbourhood to support vectors

python

classification

machine-learning

svm

supervised-learning