对 `sklearn.svm` 回归器使用自定义内核存在歧义

Question

我想在Epsilon-Support Vector Regression module of Sklearn.svm. I found this code as an example for customized kernel for svc at the scilit-learn documentation中使用自定义内核函数：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could
                  # avoid this ugly slicing by using a two-dim dataset
Y = iris.target


def my_kernel(X, Y):
    """
    We create a custom kernel:

                 (2  0)
    k(X, Y) = X  (    ) Y.T
                 (0  1)
    """
    M = np.array([[2, 0], [0, 1.0]])
    return np.dot(np.dot(X, M), Y.T)


h = .02  # step size in the mesh

# we create an instance of SVM and fit out data.
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolors='k')
plt.title('3-Class classification using Support Vector Machine with custom'
      ' kernel')
plt.axis('tight')
plt.show()

我想定义一些函数，例如：

def my_new_kernel(X):
    a,b,c = (random.randint(0,100) for _ in range(3))
    # imagine f1,f2,f3 are functions like sin(x), cos(x), ...
    ans = a*f1(X) + b*f2(X) + c*f3(X)
    return ans

我对内核方法的看法是，它是一个函数，它获取特征矩阵（X）作为输入和 returns 形状为 (n,1) 的 矩阵。然后 svm 将 返回的矩阵 附加到 特征列 并使用它来分类标签 Y.

在上面的代码中，内核用于 svm.fit 函数，我无法弄清楚 什么是内核的 X 和 Y 输入以及它们的输入形状。如果 X 和 Y（my_kernel 方法的输入）是数据集的特征和标签，那么内核如何处理我们没有标签的测试数据？

实际上我想将 svm 用于形状为 (10000, 6) 的数据集，（5 列=特征，1 列=标签）然后如果我想使用 my_new_kernel 方法会是什么输入和输出及其形状。

Answer 1

您的确切问题还不清楚；这里有一些可能会有帮助的评论。

I can't figure out what are X and Y inputs of kernel and their shapes. if X and Y (inputs of my_kernel method) are the features and label of dataset,

确实如此；来自 fit 的 documentation:

Parameters:

X : {array-like, sparse matrix}, shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples and n_features is the number of features. For kernel=”precomputed”, the expected shape of X is (n_samples, n_samples).

y : array-like, shape(n_samples,)

Target values (class labels in classification, real numbers in regression)

与默认可用内核完全一样。

so then how does the kernel work for test data where we have no labels?

仔细查看您提供的代码会发现标签 Y 确实仅在训练期间使用 (fit)；它们当然不会在预测期间使用（上面代码中的 clf.predict() - 不要与 yy 混淆，后者与 Y 无关）。

对 `sklearn.svm` 回归器使用自定义内核存在歧义

having ambiguity using customized kernel for `sklearn.svm` regressor

python

regression

machine-learning

svm

scikit-learn