如何对 AttributeSelectedClassifier 模型进行交叉验证?

How can do crossvalidation for a AttributeSelectedClassifier model?

我做了一个这样的模型:

base = Classifier(classname="weka.classifiers.trees.ADTree", 
                  options=["-B", "10", "-E", "-3", "-S", "1"])

CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier", 
                                options =["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE", 
               options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S","1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
                         options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options = ["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
                          "-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]                        
mparam = MathParameter()
mparam.prop = "numOfBoostingIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls
AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)

我正在尝试通过交叉验证来验证它,但是当我这样做时:

train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.crossvalidate_model(AttS_cls, test, 10, Random(1))

我收到此错误:

---------------------------------------------------------------------------
JavaException                             Traceback (most recent call last)
/tmp/ipykernel_50548/1197040560.py in <module>
     47 print(AttS_cls.to_commandline())
     48 evl = Evaluation(test)
---> 49 evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
     50 print(AttS_cls)
     51 print("----------------------------------------------------------------------------")

/usr/local/lib/python3.8/dist-packages/weka/classifiers.py in crossvalidate_model(self, classifier, data, num_folds, rnd, output)
   1289         else:
   1290             generator = [output.jobject]
-> 1291         javabridge.call(
   1292             self.jobject, "crossValidateModel",
   1293             "(Lweka/classifiers/Classifier;Lweka/core/Instances;ILjava/util/Random;[Ljava/lang/Object;)V",

~/.local/lib/python3.8/site-packages/javabridge/jutil.py in call(o, method_name, sig, *args)
    890     ret_sig = sig[sig.find(')')+1:]
    891     nice_args = get_nice_args(args, args_sig)
--> 892     result = fn(*nice_args)
    893     x = env.exception_occurred()
    894     if x is not None:

~/.local/lib/python3.8/site-packages/javabridge/jutil.py in fn(*args)
    857             x = env.exception_occurred()
    858             if x is not None:
--> 859                 raise JavaException(x)
    860             return result
    861     else:

JavaException: Thread-based execution of evaluation tasks failed!

所以我不知道我做错了什么,因为我知道使用 weka 可以交叉验证这种类型的模型,但我正在尝试使用 pyweka 并遇到了这个问题。

我已将您的代码片段转换为带有导入的代码片段,并修复了 Bagging 的 MultiSearch 设置(mparam.prop = "numIterations" 而不是 mparam.prop = "numOfBoostingIterations"),使其得以执行。

由于我无法访问您的数据,所以我只使用了 UCI 数据集 vote.arff

您的代码有点奇怪,因为它进行了 70/30 train/test 拆分,训练了分类器,然后对测试数据执行了 cross-validation。对于 cross-validation,您不训练分类器,因为这发生在内部 cross-validation 循环中(该循环内每个经过训练的分类器都会被丢弃,因为 cross-validation 仅用于收集统计数据)。

因此下面的代码分为三部分:

  1. 你原来的评价代码,但是被注释掉了
  2. 正确执行cross-validation
  3. 执行train/test评估

我不使用 Jupyter 笔记本并在我的 Linux Mint 上的常规虚拟环境中成功测试了代码:

  • Python: 3.8.10
  • pip freeze 的输出:
    numpy==1.22.3
    packaging==21.3
    pyparsing==3.0.7
    python-javabridge==4.0.3
    python-weka-wrapper3==0.2.7
    

修改后的代码本身:

import weka.core.jvm as jvm
from weka.core.converters import load_any_file
from weka.classifiers import Classifier, SingleClassifierEnhancer, FilteredClassifier, MultiSearch, AttributeSelectedClassifier, Evaluation
from weka.core.classes import MathParameter, from_commandline, Random, get_classname
from weka.filters import Filter
from weka.attribute_selection import ASEvaluation, ASSearch

jvm.start(packages=True)

# the dataset/path needs adjusting
data_modelos_1_2 = load_any_file("/some/where/vote.arff")
data_modelos_1_2.class_is_last()

base = Classifier(classname="weka.classifiers.trees.ADTree",
                  options=["-B", "10", "-E", "-3", "-S", "1"])

CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier",
                                     options=["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE",
               options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S", "1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
                                       options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options=["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
                          "-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]
mparam = MathParameter()
mparam.prop = "numIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls

AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls

# original
# train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
# AttS_cls.build_classifier(train)
# evl = Evaluation(test)
# evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
# print(evl.summary())

# cross-validation
print("\ncross-validation\n")
evl = Evaluation(data_modelos_1_2)
evl.crossvalidate_model(AttS_cls, data_modelos_1_2, 10, Random(1))
print(evl.summary())

# train/test split
print("\ntrain/test split\n")
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.test_model(AttS_cls, test)
print(evl.summary())

jvm.stop()

这生成了以下输出:

cross-validation


Correctly Classified Instances         416               95.6322 %
Incorrectly Classified Instances        19                4.3678 %
Kappa statistic                          0.9094
Mean absolute error                      0.0737
Root mean squared error                  0.1778
Relative absolute error                 15.5353 %
Root relative squared error             36.5084 %
Total Number of Instances              435     


train/test split


Correctly Classified Instances         126               96.1832 %
Incorrectly Classified Instances         5                3.8168 %
Kappa statistic                          0.9216
Mean absolute error                      0.0735
Root mean squared error                  0.1649
Relative absolute error                 15.3354 %
Root relative squared error             33.6949 %
Total Number of Instances              131