为什么这个干净的数据会提供奇怪的 SVM 分类结果?

Why does this clean data provide strange SVM classification results?

我的问题和疑问在下面加粗

我已成功使用 Accord.NET 的支持向量机,按照 this one. However, when using a KernelSupportVectorMachine with a OneclassSupportVectorLearning 等文档页面上的示例对其进行训练,训练过程会导致较大的错误值和错误class化验.

下面的小例子说明了我的意思。它生成密集的训练点集群,然后训练 SVM 以 class 将点确定为集群的异常值或异常值。训练集群只是一个以原点为中心的 0.6 x 0.6 正方形,训练点的间隔为 0.1:

static void Main(string[] args)
{
    // Model and training parameters
    double kernelSigma = 0.1;
    double teacherNu = 0.5;
    double teacherTolerance = 0.01;


    // Generate input point cloud, a 0.6 x 0.6 square centered at 0,0.
    double[][] trainingInputs = new double[49][];
    int inputIdx = 0;
    for (double x = -0.3; x <= 0.31; x += 0.1) {
        for (double y = -0.3; y <= 0.31; y += 0.1) {
            trainingInputs[inputIdx] = new double[] { x, y };
            inputIdx++;
        }
    }


    // Generate inlier and outlier test points.
    double[][] outliers =
    {
        new double[] { 1E6, 1E6 },  // Very far outlier
        new double[] { 0, 1E6 },    // Very far outlier
        new double[] { 100, -100 }, // Far outlier
        new double[] { 0, -100 },   // Far outlier
        new double[] { -10, -10 },  // Still far outlier
        new double[] { 0, -10 },    // Still far outlier
    };
    double[][] inliers =
    {
        new double[] { 0, 0 },      // Middle of cluster
        new double[] { .15, .15 },  // Halfway to corner of cluster
        new double[] { -0.1, 0 },   // Comfortably inside cluster
        new double[] { 0.25, 0 }    // Near inside edge of cluster
    };


    // Construct the kernel, model, and trainer, then train.
    Console.WriteLine($"Training model with parameters:");
    Console.WriteLine($"  kernelSigma = {kernelSigma.ToString("#.##")}");
    Console.WriteLine($"  teacherNu={teacherNu.ToString("#.##")}");
    Console.WriteLine($"  teacherTolerance={teacherTolerance}");
    Console.WriteLine();

    var kernel = new Gaussian(kernelSigma);
    var svm = new KernelSupportVectorMachine(kernel, inputs: 1);
    var teacher = new OneclassSupportVectorLearning(svm, trainingInputs)
    {
        Nu = teacherNu,
        Tolerance = teacherTolerance
    };
    double error = teacher.Run();

    Console.WriteLine($"Training complete - error is {error.ToString("#.##")}");
    Console.WriteLine();


    // Test trained classifier.
    Console.WriteLine("Testing outliers:");
    foreach (double[] outlier in outliers) {
        WriteResultDetail(svm, outlier);
    }
    Console.WriteLine();
    Console.WriteLine("Testing inliers:");
    foreach (double[] inlier in inliers) {
        WriteResultDetail(svm, inlier);
    }
}

private static void WriteResultDetail(KernelSupportVectorMachine svm, double[] coordinate)
{
    string prettyCoord = $"{{ {string.Join(", ", coordinate)} }}".PadRight(20);
    Console.Write($"Classifying: {prettyCoord} Result: ");

    // Classify coordinate, print results.
    double result = svm.Compute(coordinate);
    if (Math.Sign(result) == 1) {
        Console.Write("Inlier");
    }
    else {
        Console.Write("Outlier");
    }
    Console.Write($" ({result.ToString("#.##")})\n");
}

这是合理参数集的输出:

Training model with parameters:
  kernelSigma = .1
  teacherNu=.5
  teacherTolerance=0.01

Training complete - error is 222.4

Testing outliers:
Classifying: { 1000000, 1000000 } Result: Inlier (2.28)
Classifying: { 0, 1000000 }       Result: Inlier (2.28)
Classifying: { 100, -100 }        Result: Inlier (2.28)
Classifying: { 0, -100 }          Result: Inlier (2.28)
Classifying: { -10, -10 }         Result: Inlier (2.28)
Classifying: { 0, -10 }           Result: Inlier (2.28)

Testing inliers:
Classifying: { 0, 0 }             Result: Inlier (4.58)
Classifying: { 0.15, 0.15 }       Result: Inlier (4.51)
Classifying: { -0.1, 0 }          Result: Inlier (4.55)
Classifying: { 0.25, 0 }          Result: Inlier (4.64)

括号中的数字是SVM对该坐标给出的分数。使用来自 Accord.NET 的 SVM(通常),负分是一个 class,正分是另一个。在这里,一切都有积极的分数。异常值 class 正确化,但异常值(甚至 非常 远的异常值)也被 class 化为异常值。

请注意,在我用 Accord.NET 训练模型的任何其他时间,训练误差都非常接近于零,但这里超过 200。

这是另一个参数集的输出:

Training model with parameters:
  kernelSigma = .3
  teacherNu=.8
  teacherTolerance=0.01

Training complete - error is 1945.67

Testing outliers:
Classifying: { 1000000, 1000000 } Result: Inlier (20.96)
Classifying: { 0, 1000000 }       Result: Inlier (20.96)
Classifying: { 100, -100 }        Result: Inlier (20.96)
Classifying: { 0, -100 }          Result: Inlier (20.96)
Classifying: { -10, -10 }         Result: Inlier (20.96)
Classifying: { 0, -10 }           Result: Inlier (20.96)

Testing inliers:
Classifying: { 0, 0 }             Result: Inlier (44.52)
Classifying: { 0.15, 0.15 }       Result: Inlier (41.62)
Classifying: { -0.1, 0 }          Result: Inlier (43.85)
Classifying: { 0.25, 0 }          Result: Inlier (40.53)

同样,非常高的训练误差,所有正分数。

模型肯定从训练中得到了东西 - 离群值和离群值之间的分数不同。但是为什么这个简单的场景没有给出应有的正负号不同的结果?


PS。 Here is a similar program that tests many combinations of training and model parameters, and here is its output。同样,一切都会导致正 class 化分数、高错误值和不正确的 class 化离群值。

issue raised in the question has been addressed in version 3.7.0 of Accord.NET. A unit test with an example similar to yours has also been added in commit be81aab.