为什么带有 1 个估计器的 adaboost 比简单的决策树更快?
Why is adaboost with 1 estimator faster than a simple decision tree?
我想比较 adaboost
和决策树。作为原理证明,我将 adaboost
中的估计器数量设置为 1
,并将决策树分类器作为默认值,期望得到与简单决策树相同的结果。
我在预测测试标签时确实获得了相同的准确度。但是,adaboost
的拟合时间要短得多,而测试时间要长一些。 Adaboost
似乎使用与 DecisionTreeClassifier
相同的默认设置,否则,准确度不会完全相同。
谁能解释一下?
代码
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
print("creating classifier")
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
print("starting to fit")
time0 = time()
clf.fit(features_train,labels_train) #fit adaboost
fitting_time = time() - time0
print("time for fitting adaboost was", fitting_time)
time0 = time()
clf2.fit(features_train,labels_train) #fit dtree
fitting_time = time() - time0
print("time for fitting dtree was", fitting_time)
time1 = time()
pred = clf.predict(features_test) #test adaboost
test_time = time() - time1
print("time for testing adaboost was", test_time)
time1 = time()
pred = clf2.predict(features_test) #test dtree
test_time = time() - time1
print("time for testing dtree was", test_time)
accuracy_ada = accuracy_score(pred, labels_test) #acc ada
print("accuracy for adaboost is", accuracy_ada)
accuracy_dt = accuracy_score(pred, labels_test) #acc dtree
print("accuracy for dtree is", accuracy_dt)
输出
('time for fitting adaboost was', 3.8290421962738037)
('time for fitting dtree was', 85.19442415237427)
('time for testing adaboost was', 0.1834099292755127)
('time for testing dtree was', 0.056527137756347656)
('accuracy for adaboost is', 0.99089874857792948)
('accuracy for dtree is', 0.99089874857792948)
我试图在 IPython 中重复您的实验,但我没有看到如此大的差异:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np
x = np.random.randn(3785,16000)
y = (x[:,0]>0.).astype(np.float)
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
%timeit clf.fit(x,y)
1 loop, best of 3: 5.56 s per loop
%timeit clf2.fit(x,y)
1 loop, best of 3: 5.51 s per loop
尝试使用分析器,或者先重复实验。
您在以下行中定义的两个分类器:
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
实际上定义了非常不同的分类器。在第一种情况 (clf
) 中,您定义了单个 (n_estimators = 1
)、max_depth=1
决策树。这在文档中有解释:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
解释的地方:
"the base estimator is DecisionTreeClassifier(max_depth=1)"
对于第二种情况 (clf2
),您正在定义一个带有 max_depth
的决策树,它由使所有叶子纯的数量决定。同样,您可以通过阅读文档找到这一点:
故事的寓意是:阅读文档!
我想比较 adaboost
和决策树。作为原理证明,我将 adaboost
中的估计器数量设置为 1
,并将决策树分类器作为默认值,期望得到与简单决策树相同的结果。
我在预测测试标签时确实获得了相同的准确度。但是,adaboost
的拟合时间要短得多,而测试时间要长一些。 Adaboost
似乎使用与 DecisionTreeClassifier
相同的默认设置,否则,准确度不会完全相同。
谁能解释一下?
代码
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
print("creating classifier")
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
print("starting to fit")
time0 = time()
clf.fit(features_train,labels_train) #fit adaboost
fitting_time = time() - time0
print("time for fitting adaboost was", fitting_time)
time0 = time()
clf2.fit(features_train,labels_train) #fit dtree
fitting_time = time() - time0
print("time for fitting dtree was", fitting_time)
time1 = time()
pred = clf.predict(features_test) #test adaboost
test_time = time() - time1
print("time for testing adaboost was", test_time)
time1 = time()
pred = clf2.predict(features_test) #test dtree
test_time = time() - time1
print("time for testing dtree was", test_time)
accuracy_ada = accuracy_score(pred, labels_test) #acc ada
print("accuracy for adaboost is", accuracy_ada)
accuracy_dt = accuracy_score(pred, labels_test) #acc dtree
print("accuracy for dtree is", accuracy_dt)
输出
('time for fitting adaboost was', 3.8290421962738037)
('time for fitting dtree was', 85.19442415237427)
('time for testing adaboost was', 0.1834099292755127)
('time for testing dtree was', 0.056527137756347656)
('accuracy for adaboost is', 0.99089874857792948)
('accuracy for dtree is', 0.99089874857792948)
我试图在 IPython 中重复您的实验,但我没有看到如此大的差异:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np
x = np.random.randn(3785,16000)
y = (x[:,0]>0.).astype(np.float)
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
%timeit clf.fit(x,y)
1 loop, best of 3: 5.56 s per loop
%timeit clf2.fit(x,y)
1 loop, best of 3: 5.51 s per loop
尝试使用分析器,或者先重复实验。
您在以下行中定义的两个分类器:
clf = AdaBoostClassifier(n_estimators = 1)
clf2 = DecisionTreeClassifier()
实际上定义了非常不同的分类器。在第一种情况 (clf
) 中,您定义了单个 (n_estimators = 1
)、max_depth=1
决策树。这在文档中有解释:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
解释的地方:
"the base estimator is DecisionTreeClassifier(max_depth=1)"
对于第二种情况 (clf2
),您正在定义一个带有 max_depth
的决策树,它由使所有叶子纯的数量决定。同样,您可以通过阅读文档找到这一点:
故事的寓意是:阅读文档!