SVM

Question

我有以下来自 weka 的 SVM 分类输出。我想将 SVM 分类器输出绘制为异常或正常。如何从这个输出中得到 SVM scoring function？

=== 运行信息===

Scheme:       weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007"
Relation:     KDDTrain
Instances:    125973
Attributes:   42
              duration
              protocol_type
              service
              flag
              src_bytes
              dst_bytes
              land
              wrong_fragment
              urgent
              hot
              num_failed_logins
              logged_in
              num_compromised
              root_shell
              su_attempted
              num_root
              num_file_creations
              num_shells
              num_access_files
              num_outbound_cmds
              is_host_login
              is_guest_login
              count
              srv_count
              serror_rate
              srv_serror_rate
              rerror_rate
              srv_rerror_rate
              same_srv_rate
              diff_srv_rate
              srv_diff_host_rate
              dst_host_count
              dst_host_srv_count
              dst_host_same_srv_rate
              dst_host_diff_srv_rate
              dst_host_same_src_port_rate
              dst_host_srv_diff_host_rate
              dst_host_serror_rate
              dst_host_srv_serror_rate
              dst_host_rerror_rate
              dst_host_srv_rerror_rate
              class
Test mode:    10-fold cross-validation

=== Classifier 模型（完整训练集）===

SMO

Kernel used:
  Linear Kernel: K(x,y) = <x,y>

Classifier for classes: normal, anomaly

BinarySMO

Machine linear: showing attribute weights, not support vectors.

        -0.0498 * (normalized) duration
 +       0.5131 * (normalized) protocol_type=tcp
 +      -0.6236 * (normalized) protocol_type=udp
 +       0.1105 * (normalized) protocol_type=icmp
 +      -1.1861 * (normalized) service=auth
 +       0      * (normalized) service=bgp
 +       0      * (normalized) service=courier
 +       1      * (normalized) service=csnet_ns
 +       1      * (normalized) service=ctf
 +       1      * (normalized) service=daytime
 +      -0      * (normalized) service=discard
 +      -1.2505 * (normalized) service=domain
 +      -0.6878 * (normalized) service=domain_u
 +       0.9418 * (normalized) service=echo
 +       1.1964 * (normalized) service=eco_i
 +       0.9767 * (normalized) service=ecr_i
 +       0.0073 * (normalized) service=efs
 +       0.0595 * (normalized) service=exec
 +      -1.4426 * (normalized) service=finger
 +      -1.047  * (normalized) service=ftp
 +      -1.4225 * (normalized) service=ftp_data
 +       2      * (normalized) service=gopher
 +       1      * (normalized) service=hostnames
 +      -0.9961 * (normalized) service=http
 +       0.7255 * (normalized) service=http_443
 +       0.5128 * (normalized) service=imap4
 +      -6.3664 * (normalized) service=IRC
 +       1      * (normalized) service=iso_tsap
 +      -0      * (normalized) service=klogin
 +       0      * (normalized) service=kshell
 +       0.7422 * (normalized) service=ldap
 +       1      * (normalized) service=link
 +       0.5993 * (normalized) service=login
 +       1      * (normalized) service=mtp
 +       1      * (normalized) service=name
 +       0.2322 * (normalized) service=netbios_dgm
 +       0.213  * (normalized) service=netbios_ns
 +       0.1902 * (normalized) service=netbios_ssn
 +       1.1472 * (normalized) service=netstat
 +       0.0504 * (normalized) service=nnsp
 +       1.058  * (normalized) service=nntp
 +      -1      * (normalized) service=ntp_u
 +      -1.5344 * (normalized) service=other
 +       1.3595 * (normalized) service=pm_dump
 +       0.8355 * (normalized) service=pop_2
 +      -2      * (normalized) service=pop_3
 +       0      * (normalized) service=printer
 +       1.051  * (normalized) service=private
 +      -0.3082 * (normalized) service=red_i
 +       1.0034 * (normalized) service=remote_job
 +       1.0112 * (normalized) service=rje
 +      -1.0454 * (normalized) service=shell
 +      -1.6948 * (normalized) service=smtp
 +       0.1388 * (normalized) service=sql_net
 +      -0.3438 * (normalized) service=ssh
 +       1      * (normalized) service=supdup
 +       0.8756 * (normalized) service=systat
 +      -1.6856 * (normalized) service=telnet
 +      -0      * (normalized) service=tim_i
 +      -0.8579 * (normalized) service=time
 +      -0.726  * (normalized) service=urh_i
 +      -1.0285 * (normalized) service=urp_i
 +       1.0347 * (normalized) service=uucp
 +       0      * (normalized) service=uucp_path
 +       0      * (normalized) service=vmnet
 +       1      * (normalized) service=whois
 +      -1.3388 * (normalized) service=X11
 +       0      * (normalized) service=Z39_50
 +       1.7882 * (normalized) flag=OTH
 +      -3.0982 * (normalized) flag=REJ
 +      -1.7279 * (normalized) flag=RSTO
 +       1      * (normalized) flag=RSTOS0
 +       2.4264 * (normalized) flag=RSTR
 +       1.5906 * (normalized) flag=S0
 +      -1.952  * (normalized) flag=S1
 +      -0.9628 * (normalized) flag=S2
 +      -0.3455 * (normalized) flag=S3
 +       1.2757 * (normalized) flag=SF
 +       0.0054 * (normalized) flag=SH
 +       0.8742 * (normalized) src_bytes
 +       0.0542 * (normalized) dst_bytes
 +      -1.2659 * (normalized) land=1
 +       2.7922 * (normalized) wrong_fragment
 +       0.0662 * (normalized) urgent
 +       8.1153 * (normalized) hot
 +       2.4822 * (normalized) num_failed_logins
 +       0.2242 * (normalized) logged_in=1
 +      -0.0544 * (normalized) num_compromised
 +       0.9248 * (normalized) root_shell
 +      -2.363  * (normalized) su_attempted
 +      -0.2024 * (normalized) num_root
 +      -1.2791 * (normalized) num_file_creations
 +      -0.0314 * (normalized) num_shells
 +      -1.4125 * (normalized) num_access_files
 +      -0.0154 * (normalized) is_host_login=1
 +      -2.3307 * (normalized) is_guest_login=1
 +       4.3191 * (normalized) count
 +      -2.7484 * (normalized) srv_count
 +      -0.6276 * (normalized) serror_rate
 +       2.843  * (normalized) srv_serror_rate
 +       0.6105 * (normalized) rerror_rate
 +       3.1388 * (normalized) srv_rerror_rate
 +      -0.1262 * (normalized) same_srv_rate
 +      -0.1825 * (normalized) diff_srv_rate
 +       0.2961 * (normalized) srv_diff_host_rate
 +       0.7812 * (normalized) dst_host_count
 +      -1.0053 * (normalized) dst_host_srv_count
 +       0.0284 * (normalized) dst_host_same_srv_rate
 +       0.4419 * (normalized) dst_host_diff_srv_rate
 +       1.384  * (normalized) dst_host_same_src_port_rate
 +       0.8004 * (normalized) dst_host_srv_diff_host_rate
 +       0.2301 * (normalized) dst_host_serror_rate
 +       0.6401 * (normalized) dst_host_srv_serror_rate
 +       0.6422 * (normalized) dst_host_rerror_rate
 +       0.3692 * (normalized) dst_host_srv_rerror_rate
 -       2.5266

Number of kernel evaluations: -1049600465

输出预测 - 示例输出

inst#     actual  predicted error prediction
        1   1:normal   1:normal       1
        2   1:normal   1:normal       1
        3  2:anomaly  2:anomaly       1
        4   1:normal   1:normal       1
        5   1:normal   1:normal       1
        6  2:anomaly  2:anomaly       1
        7  2:anomaly  2:anomaly       1
        8  2:anomaly  2:anomaly       1
        9  2:anomaly  2:anomaly       1
       10  2:anomaly  2:anomaly       1
       11  2:anomaly  2:anomaly       1
       12  2:anomaly  2:anomaly       1
       13   1:normal   1:normal       1
       14  2:anomaly   1:normal   +   1
       15  2:anomaly  2:anomaly       1
       16  2:anomaly  2:anomaly       1
       17   1:normal   1:normal       1
       18  2:anomaly  2:anomaly       1
       19   1:normal   1:normal       1
       20   1:normal   1:normal       1
       21  2:anomaly  2:anomaly       1
       22  2:anomaly  2:anomaly       1
       23   1:normal   1:normal       1
       24   1:normal   1:normal       1
       25  2:anomaly  2:anomaly       1
       26   1:normal   1:normal       1
       27  2:anomaly  2:anomaly       1
       28   1:normal   1:normal       1
       29   1:normal   1:normal       1
       30   1:normal   1:normal       1
       31  2:anomaly  2:anomaly       1
       32  2:anomaly  2:anomaly       1
       33   1:normal   1:normal       1
       34  2:anomaly  2:anomaly       1
       35   1:normal   1:normal       1
       36   1:normal   1:normal       1
       37   1:normal   1:normal       1
       38  2:anomaly  2:anomaly       1
       39   1:normal   1:normal       1
       40  2:anomaly  2:anomaly       1
       41  2:anomaly  2:anomaly       1
       42  2:anomaly  2:anomaly       1
       43   1:normal   1:normal       1
       44   1:normal   1:normal       1
       45   1:normal   1:normal       1
       46  2:anomaly  2:anomaly       1
       47  2:anomaly  2:anomaly       1
       48   1:normal   1:normal       1
       49  2:anomaly   1:normal   +   1
       50  2:anomaly  2:anomaly       1

=== 详细精度 Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.986    0.039    0.967      0.986    0.976      0.948    0.973     0.960     normal
                 0.961    0.014    0.983      0.961    0.972      0.948    0.973     0.963     anomaly
Weighted Avg.    0.974    0.028    0.974      0.974    0.974      0.948    0.973     0.962

===混淆矩阵===

     a     b   <-- classified as
 66389   954 |     a = normal
  2301 56329 |     b = anomaly

Answer 1

输出是评分函数。将等号理解为一个简单的布尔运算符，计算结果为 1 表示真，0 表示假。因此，在分类属性的所有选择中，只有一个系数会影响评分值。

例如，让我们只考虑前三个属性，以及这些标准化的输入和结果值：

duration      2.0     -0.0498 * 2.0 => -0.0996
protocol_type icmp     0.1105
service       eco_i    1.1964

请注意其他 protocol_type 和 service 条款（例如

-0.6236 * protocol_type=udp

) 的比较结果为 0（protocol_type=upd 变为 0），因此这些系数不会影响总和。

根据这三个属性，到目前为止的分数是这三项的总和，即 1.2073。继续其他 39 个属性，最后加上常量 -2.5266，这就是你的向量的分数。

这样解释够了吗？

您引用的博客中的关键词是：

if the output of the scoring function is negative then the input is classified as belonging to class y = -1. If the score is positive, the input is classified as belonging to class y = 1.

是的，就这么简单：实现漂亮的线性评分函数（42 个变量，116 个项）。插入一个向量。如果函数为正，则向量正常；如果结果为负，则向量异常。

是的，您的模型比博客的示例长得多。该示例基于两个连续的特征；你有 42 个特征，其中三个是分类特征（因此有额外的 73 个术语）。该示例有 3 个支持向量；你的将有 43 个（N 维需要 N+1 个支持向量）。然而，即使是这个 42 维模型也遵循相同的原则：正 = 正常，负 = 异常。

至于你想映射到二维显示...这是可能...但我不知道你会找到什么有意义 在这种情况下。将 42 个变量映射到 3 个会在我们的 space 中造成很多拥塞。我在这里和那里看到了一些不错的技巧，尤其是在梯度场中，力矢量与数据点具有相同的空间解释。天气图设法表示测量的 x、y、z 坐标，将风速 (3D)、云量和可能的其他几个指标添加到显示中。这可能是 10 个符号维度。

在你的情况下，我们或许可以将系数小于 0.07 的维度视为无关紧要；节省了 6 个特征。我们或许可以用颜色、dashed/dotted/solid 符号和 O 或 X（normal/anomaly 数据）上的微小文本覆盖来表示这三个分类特征。在不使用笛卡尔位置（x、y、z 坐标，假设该图在 3D 中有意义）的情况下向下 9 次。

但是，我不太了解您的数据，无法建议我们将其余 33 个特征塞入 2 维或 3 维的位置。你能以某种方式组合这些输入中的任何一个吗？多个特征的线性组合给你的结果在预测中是否仍然有意义？

如果不是，那么我们将坚持使用规范方法：选择有趣的特征组合（通常是成对的）。为每个特征绘制一个图表，完全忽略其他特征。如果其中 none 具有视觉意义……我们的答案是：不，我们不能很好地绘制数据。抱歉，但现实往往在复杂的环境中对我们如此，我们处理表中的数据、相关性以及我们可以用我们的 3D 思维处理的其他方法。

Answer 2

种为什么不完全不同，但我想它可以解决你的根本问题。我假设您使用 Weka Explorer 生成模型。如果您转到 Classify tab，单击 More 选项...并勾选 Output predictions。你得到每个 classification 的概率。这应该允许您绘制正常与异常

对于iris，我得到类似

的东西

inst#,    actual, predicted, error, probability distribution
     1 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     2 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     3 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     4 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     5 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     6 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     7 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     8 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     9 1:Iris-set 1:Iris-set         *0.667  0.333  0    
    10 1:Iris-set 1:Iris-set         *0.667  0.333  0

它包含每个 class 的概率。

SVM - 评分函数

SVM - scoring function

scoring

machine-learning

weka