删除特征后重新计算特征重要性

recalculating feature importance after removing a feature

因此,我使用 运行dom 森林算法对鸢尾花数据集进行了分类,然后生成了一个特征重要性分数。然后我删除了最不相关的特征,并再次通过 RF 算法 运行 调整后的数据。我想重新计算特征重要性分数,但我使用的代码仍然需要 4 个特征,因为它使用原始 iris 数据集作为索引,而不是新的 pandas 数据框,我以前只有 3 个特征训练模型。如何修复我的代码,以免出现此错误:

Traceback (most recent call last):
  File "/Users/userPycharmProjects/Iris_classifier_RF/feature_import_reclassify.py", line 61, in <module>
    feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
  File "/Users/user/.conda/envs/GST/lib/python3.8/site-packages/pandas/core/series.py", line 350, in __init__
    raise ValueError(
ValueError: Length of passed values is 3, index implies 4.

所有代码如下:

# Importing required libraries
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
import sklearn.metrics as metrics


# Loading datasets
iris = load_iris()

# Convert to pandas dataframe
iris_data = pd.DataFrame({
    'sepal length':iris.data[:,0],
    'petal length':iris.data[:,1],
    'petal width':iris.data[:,2],
    'species':iris.target
})
iris_data.head()

# printing categories (setosa, versicolor, virginica)
print(iris.target_names)
# print flower features
print(iris.feature_names)

# setting independent (X) and dependent (Y) variables
X = iris_data[['sepal length', 'petal length', 'petal width']]  # Features
Y = iris_data['species']  # Labels


# printing feature data
print(X[0:3])
# printing dependent variable values (0 = setosa, 1 = versicolor, 3 = virginica)
print(Y)

# splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)

# defining random forest classifier
clfr = RandomForestClassifier(random_state = 100)
clfr.fit(X_train, y_train)

# making prediction
Y_pred = clfr.predict(X_test)

# checking model accuracy
print("Accuracy:", metrics.accuracy_score(y_test, Y_pred))
cm = np.array(confusion_matrix(y_test, Y_pred))
print(cm)

# making predictions on new data
species_id = clfr.predict([[5.1, 3.5, 1.4]])
iris.target_names[species_id]
print(iris.target_names[species_id])

# determining feature importance (e.g. model participation)
feature_imp = pd.Series(clfr.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
print(feature_imp)

import matplotlib.pyplot as plt
import seaborn as sns

# Creating a bar plot to visualize feature participation in model
sns.barplot(x=feature_imp, y=feature_imp.index)

# use '%matplotlib inline' to plot inline in jupyter notebooks
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

模型是在 X 上训练的,它只是 iris 的一个子集,但 feature_imp 仍然引用 index=iris.feature_names。应该改为 index=X.columns:

feature_imp = pd.Series(clfr.feature_importances_, index=X.columns).sort_values(ascending=False)