如何对 AttributeSelectedClassifier 模型进行交叉验证?
How can do crossvalidation for a AttributeSelectedClassifier model?
我做了一个这样的模型:
base = Classifier(classname="weka.classifiers.trees.ADTree",
options=["-B", "10", "-E", "-3", "-S", "1"])
CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier",
options =["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE",
options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S","1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options = ["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
"-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]
mparam = MathParameter()
mparam.prop = "numOfBoostingIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls
AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
我正在尝试通过交叉验证来验证它,但是当我这样做时:
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
我收到此错误:
---------------------------------------------------------------------------
JavaException Traceback (most recent call last)
/tmp/ipykernel_50548/1197040560.py in <module>
47 print(AttS_cls.to_commandline())
48 evl = Evaluation(test)
---> 49 evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
50 print(AttS_cls)
51 print("----------------------------------------------------------------------------")
/usr/local/lib/python3.8/dist-packages/weka/classifiers.py in crossvalidate_model(self, classifier, data, num_folds, rnd, output)
1289 else:
1290 generator = [output.jobject]
-> 1291 javabridge.call(
1292 self.jobject, "crossValidateModel",
1293 "(Lweka/classifiers/Classifier;Lweka/core/Instances;ILjava/util/Random;[Ljava/lang/Object;)V",
~/.local/lib/python3.8/site-packages/javabridge/jutil.py in call(o, method_name, sig, *args)
890 ret_sig = sig[sig.find(')')+1:]
891 nice_args = get_nice_args(args, args_sig)
--> 892 result = fn(*nice_args)
893 x = env.exception_occurred()
894 if x is not None:
~/.local/lib/python3.8/site-packages/javabridge/jutil.py in fn(*args)
857 x = env.exception_occurred()
858 if x is not None:
--> 859 raise JavaException(x)
860 return result
861 else:
JavaException: Thread-based execution of evaluation tasks failed!
所以我不知道我做错了什么,因为我知道使用 weka 可以交叉验证这种类型的模型,但我正在尝试使用 pyweka 并遇到了这个问题。
我已将您的代码片段转换为带有导入的代码片段,并修复了 Bagging 的 MultiSearch 设置(mparam.prop = "numIterations"
而不是 mparam.prop = "numOfBoostingIterations"
),使其得以执行。
由于我无法访问您的数据,所以我只使用了 UCI 数据集 vote.arff。
您的代码有点奇怪,因为它进行了 70/30 train/test 拆分,训练了分类器,然后对测试数据执行了 cross-validation。对于 cross-validation,您不训练分类器,因为这发生在内部 cross-validation 循环中(该循环内每个经过训练的分类器都会被丢弃,因为 cross-validation 仅用于收集统计数据)。
因此下面的代码分为三部分:
- 你原来的评价代码,但是被注释掉了
- 正确执行cross-validation
- 执行train/test评估
我不使用 Jupyter 笔记本并在我的 Linux Mint 上的常规虚拟环境中成功测试了代码:
- Python:
3.8.10
pip freeze
的输出:
numpy==1.22.3
packaging==21.3
pyparsing==3.0.7
python-javabridge==4.0.3
python-weka-wrapper3==0.2.7
修改后的代码本身:
import weka.core.jvm as jvm
from weka.core.converters import load_any_file
from weka.classifiers import Classifier, SingleClassifierEnhancer, FilteredClassifier, MultiSearch, AttributeSelectedClassifier, Evaluation
from weka.core.classes import MathParameter, from_commandline, Random, get_classname
from weka.filters import Filter
from weka.attribute_selection import ASEvaluation, ASSearch
jvm.start(packages=True)
# the dataset/path needs adjusting
data_modelos_1_2 = load_any_file("/some/where/vote.arff")
data_modelos_1_2.class_is_last()
base = Classifier(classname="weka.classifiers.trees.ADTree",
options=["-B", "10", "-E", "-3", "-S", "1"])
CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier",
options=["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE",
options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S", "1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options=["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
"-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]
mparam = MathParameter()
mparam.prop = "numIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls
AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls
# original
# train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
# AttS_cls.build_classifier(train)
# evl = Evaluation(test)
# evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
# print(evl.summary())
# cross-validation
print("\ncross-validation\n")
evl = Evaluation(data_modelos_1_2)
evl.crossvalidate_model(AttS_cls, data_modelos_1_2, 10, Random(1))
print(evl.summary())
# train/test split
print("\ntrain/test split\n")
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.test_model(AttS_cls, test)
print(evl.summary())
jvm.stop()
这生成了以下输出:
cross-validation
Correctly Classified Instances 416 95.6322 %
Incorrectly Classified Instances 19 4.3678 %
Kappa statistic 0.9094
Mean absolute error 0.0737
Root mean squared error 0.1778
Relative absolute error 15.5353 %
Root relative squared error 36.5084 %
Total Number of Instances 435
train/test split
Correctly Classified Instances 126 96.1832 %
Incorrectly Classified Instances 5 3.8168 %
Kappa statistic 0.9216
Mean absolute error 0.0735
Root mean squared error 0.1649
Relative absolute error 15.3354 %
Root relative squared error 33.6949 %
Total Number of Instances 131
我做了一个这样的模型:
base = Classifier(classname="weka.classifiers.trees.ADTree",
options=["-B", "10", "-E", "-3", "-S", "1"])
CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier",
options =["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE",
options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S","1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options = ["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
"-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]
mparam = MathParameter()
mparam.prop = "numOfBoostingIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls
AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
我正在尝试通过交叉验证来验证它,但是当我这样做时:
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
我收到此错误:
---------------------------------------------------------------------------
JavaException Traceback (most recent call last)
/tmp/ipykernel_50548/1197040560.py in <module>
47 print(AttS_cls.to_commandline())
48 evl = Evaluation(test)
---> 49 evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
50 print(AttS_cls)
51 print("----------------------------------------------------------------------------")
/usr/local/lib/python3.8/dist-packages/weka/classifiers.py in crossvalidate_model(self, classifier, data, num_folds, rnd, output)
1289 else:
1290 generator = [output.jobject]
-> 1291 javabridge.call(
1292 self.jobject, "crossValidateModel",
1293 "(Lweka/classifiers/Classifier;Lweka/core/Instances;ILjava/util/Random;[Ljava/lang/Object;)V",
~/.local/lib/python3.8/site-packages/javabridge/jutil.py in call(o, method_name, sig, *args)
890 ret_sig = sig[sig.find(')')+1:]
891 nice_args = get_nice_args(args, args_sig)
--> 892 result = fn(*nice_args)
893 x = env.exception_occurred()
894 if x is not None:
~/.local/lib/python3.8/site-packages/javabridge/jutil.py in fn(*args)
857 x = env.exception_occurred()
858 if x is not None:
--> 859 raise JavaException(x)
860 return result
861 else:
JavaException: Thread-based execution of evaluation tasks failed!
所以我不知道我做错了什么,因为我知道使用 weka 可以交叉验证这种类型的模型,但我正在尝试使用 pyweka 并遇到了这个问题。
我已将您的代码片段转换为带有导入的代码片段,并修复了 Bagging 的 MultiSearch 设置(mparam.prop = "numIterations"
而不是 mparam.prop = "numOfBoostingIterations"
),使其得以执行。
由于我无法访问您的数据,所以我只使用了 UCI 数据集 vote.arff。
您的代码有点奇怪,因为它进行了 70/30 train/test 拆分,训练了分类器,然后对测试数据执行了 cross-validation。对于 cross-validation,您不训练分类器,因为这发生在内部 cross-validation 循环中(该循环内每个经过训练的分类器都会被丢弃,因为 cross-validation 仅用于收集统计数据)。
因此下面的代码分为三部分:
- 你原来的评价代码,但是被注释掉了
- 正确执行cross-validation
- 执行train/test评估
我不使用 Jupyter 笔记本并在我的 Linux Mint 上的常规虚拟环境中成功测试了代码:
- Python:
3.8.10
pip freeze
的输出:numpy==1.22.3 packaging==21.3 pyparsing==3.0.7 python-javabridge==4.0.3 python-weka-wrapper3==0.2.7
修改后的代码本身:
import weka.core.jvm as jvm
from weka.core.converters import load_any_file
from weka.classifiers import Classifier, SingleClassifierEnhancer, FilteredClassifier, MultiSearch, AttributeSelectedClassifier, Evaluation
from weka.core.classes import MathParameter, from_commandline, Random, get_classname
from weka.filters import Filter
from weka.attribute_selection import ASEvaluation, ASSearch
jvm.start(packages=True)
# the dataset/path needs adjusting
data_modelos_1_2 = load_any_file("/some/where/vote.arff")
data_modelos_1_2.class_is_last()
base = Classifier(classname="weka.classifiers.trees.ADTree",
options=["-B", "10", "-E", "-3", "-S", "1"])
CostS_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.CostSensitiveClassifier",
options=["-cost-matrix", "[0.0 1.0; 1.0 0.0]", "-S", "1"])
CostS_cls.classifier = base
smote = Filter(classname="weka.filters.supervised.instance.SMOTE",
options=["-C", "0", "-K", "3", "-P", "250.0", "-S", "1"])
fc = FilteredClassifier(options=["-S", "1"])
fc.filter = smote
fc.classifier = CostS_cls
bagging_cls = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging",
options=["-P", "100", "-S", "1", "-num-slots", "1", "-I", "100"])
bagging_cls.classifier = fc
multisearch_cls = MultiSearch(options=["-S", "1"])
multisearch_cls.evaluation = "FM"
multisearch_cls.search = ["-sample-size", "100", "-initial-folds", "2", "-subsequent-folds", "10",
"-initial-test-set", ".", "-subsequent-test-set", ".", "-num-slots", "1"]
mparam = MathParameter()
mparam.prop = "numIterations"
mparam.minimum = 5.0
mparam.maximum = 50.0
mparam.step = 1.0
mparam.base = 10.0
mparam.expression = "I"
multisearch_cls.parameters = [mparam]
multisearch_cls.classifier = bagging_cls
AttS_cls = AttributeSelectedClassifier()
AttS_cls.search = from_commandline('weka.attributeSelection.GreedyStepwise -B -T -1.7976931348623157E308 -N -1 -num-slots 1', classname=get_classname(ASSearch))
AttS_cls.evaluation = from_commandline('weka.attributeSelection.CfsSubsetEval -P 1 -E 1', classname=get_classname(ASEvaluation))
AttS_cls.classifier = multisearch_cls
# original
# train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
# AttS_cls.build_classifier(train)
# evl = Evaluation(test)
# evl.crossvalidate_model(AttS_cls, test, 10, Random(1))
# print(evl.summary())
# cross-validation
print("\ncross-validation\n")
evl = Evaluation(data_modelos_1_2)
evl.crossvalidate_model(AttS_cls, data_modelos_1_2, 10, Random(1))
print(evl.summary())
# train/test split
print("\ntrain/test split\n")
train, test = data_modelos_1_2.train_test_split(70.0, Random(1))
AttS_cls.build_classifier(train)
evl = Evaluation(test)
evl.test_model(AttS_cls, test)
print(evl.summary())
jvm.stop()
这生成了以下输出:
cross-validation
Correctly Classified Instances 416 95.6322 %
Incorrectly Classified Instances 19 4.3678 %
Kappa statistic 0.9094
Mean absolute error 0.0737
Root mean squared error 0.1778
Relative absolute error 15.5353 %
Root relative squared error 36.5084 %
Total Number of Instances 435
train/test split
Correctly Classified Instances 126 96.1832 %
Incorrectly Classified Instances 5 3.8168 %
Kappa statistic 0.9216
Mean absolute error 0.0735
Root mean squared error 0.1649
Relative absolute error 15.3354 %
Root relative squared error 33.6949 %
Total Number of Instances 131