如何使用 xgboost 以少量选定功能获得最高准确度?
How to get the highest accuracy with low number of selected features using xgboost?
我一直在寻找几种特征 selection 方法,并在 XGBoost 的帮助下从以下 link (XGBoost feature importance and selection) 中找到了特征 selection。我为我的案例实施了该方法,结果如下:
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.097,n= 7,准确度:55.56%
- 阈值= 0.105,n= 6,准确度:55.56%
- 阈值= 0.110,n= 5,准确度:50.00%
- 阈值= 0.114,n= 4,准确度:50.00%
- 阈值= 0.169,n= 3,准确度:44.44%
- 阈值= 0.177,n= 2,准确度:38.89%
- 阈值= 0.228,n= 1,准确度:33.33%
所以,我的问题如下,对于这种情况,我如何才能 select 使用较少的特征 [n] 获得最高精度? [代码可以在link]
中找到
编辑 1:
感谢@Mihai Petre,我设法让它与他的回答中的代码一起工作。我还有一个问题,假设我 运行 来自 link 的代码并得到以下内容:
Feature Importance results = [29.205832 5.0182242 0. 0. 0. 6.7736177 16.704327 18.75632 9.529003 14.012676 0. ]
Features = [ 0 7 6 9 8 5 1 10 4 3 2]
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.050,n= 7,准确度:38.89%
- 阈值= 0.068,n= 6,准确度:38.89%
- 阈值= 0.095,n= 5,准确度:33.33%
- 阈值= 0.140,n= 4,准确度:38.89%
- 阈值= 0.167,n= 3,准确度:33.33%
- 阈值= 0.188,n= 2,准确度:38.89%
- 阈值= 0.292,n= 1,准确度:38.89%
如何删除特征重要性为零的特征并获得特征重要性值的特征?
附带问题:
- 我正在尝试找到涉及使用的最佳特征 selection
特定的分类模型和有助于给出的最佳特征
高精度,例如,使用 KNN 分类器并希望
找到给出高精度的最佳特征。什么功能
select离子是否适合使用?
- 当实现多个分类模型时,最好为每个分类模型做特征 selection 还是需要做一次特征 selection 然后使用 selected多个分类模型的特征?
好的,那么你 link 中的那个人正在用
做什么
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
是创建一个排序的阈值数组,然后他为thresholds
数组的每个元素训练XGBoost。
根据你的问题,我认为你只想 select 第 6 种情况,即特征数量最少但准确度最高的情况。对于这种情况,你会想做这样的事情:
selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))
如果你想使整个事情自动化,那么你需要计算在 for 循环中精度达到最大值的最小 n,它看起来或多或少像这样:
n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
我设法解决了。请找到以下代码:
以最高精度获得最少数量的特征:
# Fit the model:
f_max = 8
f_min = 2
acc_max = accuracy
thresholds = np.sort(model_FS.feature_importances_)
obj_thresh = thresholds[0]
accuracy_list = []
for thresh in thresholds:
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
accuracy = accuracy * 100
print('Thresh= %.3f, n= %d, Accuracy: %.2f%%' % (thresh, select_X_train.shape[1], accuracy))
accuracy_list.append(accuracy)
if(select_X_train.shape[1] < f_max) and (select_X_train.shape[1] >= f_min) and (accuracy >= acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
print("Selected: Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
key_list = list(range(X_train.shape[1], 0, -1))
accuracy_dict = dict(zip(key_list, accuracy_list))
optimum_num_feat = n_min
print(optimum_num_feat)
# Printing out the features:
X_train = X_train.iloc[:, optimum_number_features]
X_test = X_test.iloc[:, optimum_number_features]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)
获取重要性值不为零的特征:
# Calculate feature importances
importances = model_FS.feature_importances_
print((model_FS.feature_importances_) * 100)
# Organising the feature importance in dictionary:
## The key value depends on your maximum number of features:
key_list = range(0, 11, 1)
feature_importance_dict = dict(zip(key_list, importances))
sort_feature_importance_dict = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))
print('Feature Importnace Dictionary (Sorted): ', sort_feature_importance_dict)
# Removing the features that have value zero in feature importance:
filtered_feature_importance_dict = {x:y for x,y in sort_feature_importance_dict.items() if y!=0}
print('Filtered Feature Importnace Dictionary: ', filtered_feature_importance_dict)
f_indices = list(filtered_feature_importance_dict.keys())
f_indices = np.asarray(f_indices)
print(f_indices)
X_train = X_train.loc[:, f_indices]
X_test = X_test.loc[:, f_indices]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)
我一直在寻找几种特征 selection 方法,并在 XGBoost 的帮助下从以下 link (XGBoost feature importance and selection) 中找到了特征 selection。我为我的案例实施了该方法,结果如下:
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.000,n= 11,准确度:55.56%
- 阈值= 0.097,n= 7,准确度:55.56%
- 阈值= 0.105,n= 6,准确度:55.56%
- 阈值= 0.110,n= 5,准确度:50.00%
- 阈值= 0.114,n= 4,准确度:50.00%
- 阈值= 0.169,n= 3,准确度:44.44%
- 阈值= 0.177,n= 2,准确度:38.89%
- 阈值= 0.228,n= 1,准确度:33.33%
所以,我的问题如下,对于这种情况,我如何才能 select 使用较少的特征 [n] 获得最高精度? [代码可以在link]
中找到编辑 1:
感谢@Mihai Petre,我设法让它与他的回答中的代码一起工作。我还有一个问题,假设我 运行 来自 link 的代码并得到以下内容:
Feature Importance results = [29.205832 5.0182242 0. 0. 0. 6.7736177 16.704327 18.75632 9.529003 14.012676 0. ]
Features = [ 0 7 6 9 8 5 1 10 4 3 2]
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.000,n= 11,准确度:38.89%
- 阈值= 0.050,n= 7,准确度:38.89%
- 阈值= 0.068,n= 6,准确度:38.89%
- 阈值= 0.095,n= 5,准确度:33.33%
- 阈值= 0.140,n= 4,准确度:38.89%
- 阈值= 0.167,n= 3,准确度:33.33%
- 阈值= 0.188,n= 2,准确度:38.89%
- 阈值= 0.292,n= 1,准确度:38.89%
如何删除特征重要性为零的特征并获得特征重要性值的特征?
附带问题:
- 我正在尝试找到涉及使用的最佳特征 selection 特定的分类模型和有助于给出的最佳特征 高精度,例如,使用 KNN 分类器并希望 找到给出高精度的最佳特征。什么功能 select离子是否适合使用?
- 当实现多个分类模型时,最好为每个分类模型做特征 selection 还是需要做一次特征 selection 然后使用 selected多个分类模型的特征?
好的,那么你 link 中的那个人正在用
做什么thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
是创建一个排序的阈值数组,然后他为thresholds
数组的每个元素训练XGBoost。
根据你的问题,我认为你只想 select 第 6 种情况,即特征数量最少但准确度最高的情况。对于这种情况,你会想做这样的事情:
selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))
如果你想使整个事情自动化,那么你需要计算在 for 循环中精度达到最大值的最小 n,它看起来或多或少像这样:
n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
我设法解决了。请找到以下代码:
以最高精度获得最少数量的特征:
# Fit the model:
f_max = 8
f_min = 2
acc_max = accuracy
thresholds = np.sort(model_FS.feature_importances_)
obj_thresh = thresholds[0]
accuracy_list = []
for thresh in thresholds:
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
accuracy = accuracy * 100
print('Thresh= %.3f, n= %d, Accuracy: %.2f%%' % (thresh, select_X_train.shape[1], accuracy))
accuracy_list.append(accuracy)
if(select_X_train.shape[1] < f_max) and (select_X_train.shape[1] >= f_min) and (accuracy >= acc_max):
n_min = select_X_train.shape[1]
acc_max = accuracy
obj_thresh = thresh
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
print("Selected: Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
key_list = list(range(X_train.shape[1], 0, -1))
accuracy_dict = dict(zip(key_list, accuracy_list))
optimum_num_feat = n_min
print(optimum_num_feat)
# Printing out the features:
X_train = X_train.iloc[:, optimum_number_features]
X_test = X_test.iloc[:, optimum_number_features]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)
获取重要性值不为零的特征:
# Calculate feature importances
importances = model_FS.feature_importances_
print((model_FS.feature_importances_) * 100)
# Organising the feature importance in dictionary:
## The key value depends on your maximum number of features:
key_list = range(0, 11, 1)
feature_importance_dict = dict(zip(key_list, importances))
sort_feature_importance_dict = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))
print('Feature Importnace Dictionary (Sorted): ', sort_feature_importance_dict)
# Removing the features that have value zero in feature importance:
filtered_feature_importance_dict = {x:y for x,y in sort_feature_importance_dict.items() if y!=0}
print('Filtered Feature Importnace Dictionary: ', filtered_feature_importance_dict)
f_indices = list(filtered_feature_importance_dict.keys())
f_indices = np.asarray(f_indices)
print(f_indices)
X_train = X_train.loc[:, f_indices]
X_test = X_test.loc[:, f_indices]
print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)