如何从我的模型中获取包含 k 个最重要特征的图表？

Question

您好，我正在使用具有多个矩阵的随机森林，我想获得模型的 k 个最佳特征，

我的意思是在我的模型中更相关的 3、4 或 k 个特征，我尝试如下：

然而，这种方法的问题是我得到了我所有特征的图，因为我计算了很多这不像我希望的那样可解释，因此我想感谢支持修改上面的代码以获得固定数字的图表功能，我想将其修复为参数，

import numpy as np
import matplotlib.pyplot as plt

train_matrix = np.concatenate([state_matrix,company_matrix,seg,complete_work,sub_rep,b_tec,time1,time2,time3,time4,time5,len1], axis=1)

#Performing a shuffle of my data
index_list = list(range(train_matrix.shape[0]))
random.shuffle(index_list)
train_matrix= train_matrix[index_list]
labels_list= labels_list[index_list]

print('times shape: ', time_matrix.shape)
print('cities shape: ', cities.shape)
print('labels1 shape: ', labels1.shape)
print('state shape: ', state_matrix.shape)
print('work type shape: ', work_type.shape)
print('train matrix shape', train_matrix.shape)
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    train_matrix, labels_list.tolist(), test_size=0.1, random_state=47)

clf2 = RandomForestClassifier(n_estimators=100,n_jobs=4)

print("vectorization completed")
print("begining training")
import timeit
start_time = timeit.default_timer()

clf2 = clf2.fit(X_train, y_train)
elapsed = timeit.default_timer() - start_time

print('Matrix time shape: '+str(train_matrix.shape)+' Time Seconds: ',elapsed)

#with open('random_forest.pickle','wb') as idxf:
#    pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
print("finishing training")

y_pred = clf2.predict(X_test)

这是我想要修改的部分，以便获得模型的 k 个最佳值：

importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.savefig('fig1.png', dpi = 600)

plt.show()

这是代码的另一部分：

print("PREDICTION REPORT")
# importing Confusion Matrix and recall
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print(confusion_matrix(y_test, y_pred))

# to print unique values
print(set(y_test))
print(set(y_pred))

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


Output:

Feature ranking:
1. feature 660 (0.403711)
2. feature 655 (0.139531)
3. feature 659 (0.058074)
4. feature 658 (0.057855)
5. feature 321 (0.015031)
6. feature 322 (0.012731)
7. feature 324 (0.011937)
8. feature 336 (0.011728)
9. feature 650 (0.011174)
10. feature 656 (0.010441)
11. feature 657 (0.009340)
12. feature 337 (0.007385)
13. feature 509 (0.005184)
14. feature 330 (0.005056)
15. feature 325 (0.004927)
16. feature 344 (0.004891)
17. feature 326 (0.004495)
18. feature 334 (0.004349)
19. feature 333 (0.004291)
20. feature 352 (0.004284)
21. feature 338 (0.004164)
22. feature 285 (0.003909)
23. feature 345 (0.003631)
24. feature 652 (0.003341)
25. feature 329 (0.003168)
26. feature 651 (0.002890)
27. feature 388 (0.002680)
28. feature 146 (0.002650)
29. feature 332 (0.002482)
30. feature 217 (0.002475)
31. feature 513 (0.002363)
32. feature 216 (0.002309)
33. feature 116 (0.002223)
34. feature 323 (0.002107)
35. feature 213 (0.002104)
36. feature 328 (0.002101)
37. feature 102 (0.002088)
38. feature 315 (0.002083)
39. feature 307 (0.002079)
40. feature 427 (0.002043)
41. feature 351 (0.001925)
42. feature 259 (0.001888)
43. feature 171 (0.001878)
44. feature 243 (0.001863)
45. feature 78 (0.001862)
46. feature 490 (0.001815)
47. feature 339 (0.001770)
48. feature 103 (0.001767)
49. feature 591 (0.001741)
50. feature 55 (0.001734)
51. feature 502 (0.001665)
52. feature 194 (0.001632)
53. feature 491 (0.001625)
54. feature 50 (0.001591)
55. feature 193 (0.001590)
56. feature 97 (0.001549)
57. feature 510 (0.001514)
58. feature 245 (0.001504)
59. feature 434 (0.001497)
60. feature 8 (0.001468)
61. feature 241 (0.001457)
62. feature 108 (0.001454)
63. feature 232 (0.001453)
64. feature 292 (0.001443)
65. feature 96 (0.001434)
66. feature 99 (0.001381)
67. feature 11 (0.001367)
68. feature 106 (0.001360)
69. feature 592 (0.001335)
70. feature 60 (0.001334)
71. feature 523 (0.001327)
72. feature 72 (0.001324)
73. feature 236 (0.001323)
74. feature 128 (0.001320)
75. feature 144 (0.001318)
76. feature 288 (0.001300)
77. feature 238 (0.001292)
78. feature 654 (0.001287)
79. feature 499 (0.001285)
80. feature 223 (0.001283)
81. feature 593 (0.001275)
82. feature 33 (0.001264)
83. feature 289 (0.001240)
84. feature 94 (0.001236)
85. feature 433 (0.001233)
86. feature 129 (0.001227)
87. feature 437 (0.001226)
88. feature 113 (0.001221)
89. feature 54 (0.001220)
90. feature 271 (0.001213)
91. feature 107 (0.001186)
92. feature 562 (0.001165)
93. feature 488 (0.001144)
94. feature 521 (0.001128)
95. feature 269 (0.001110)
96. feature 313 (0.001102)
97. feature 13 (0.001063)
98. feature 59 (0.001059)
99. feature 529 (0.001059)
100. feature 278 (0.001055)
101. feature 68 (0.001053)
102. feature 189 (0.001038)
103. feature 176 (0.001001)
104. feature 367 (0.001000)
105. feature 32 (0.001000)
106. feature 18 (0.000984)
107. feature 135 (0.000957)
108. feature 127 (0.000933)
109. feature 39 (0.000924)
110. feature 391 (0.000921)
111. feature 156 (0.000919)
112. feature 316 (0.000904)
113. feature 389 (0.000895)
114. feature 522 (0.000885)
115. feature 449 (0.000874)
116. feature 4 (0.000872)
117. feature 258 (0.000840)
118. feature 489 (0.000828)
119. feature 347 (0.000823)
120. feature 264 (0.000790)

收到这里的反馈后，我尝试了：

importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
#So you just need to change this part accordingly (just change top_k to your desired value):

# Print the feature ranking
print("Feature ranking:")

for f in range(top_k):
    print("%d. feature %d (%f)" % (f + 1, new_indices[f], importances[new_indices[f]]))
#Same here for plotting the graph:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")

    importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf2.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]
#So you just need to change this part accordingly (just change top_k to your desired value):

# Print the feature ranking
print("Feature ranking:")

for f in range(top_k):
    print("%d. feature %d (%f)" % (f + 1, new_indices[f], importances[new_indices[f]]))
#Same here for plotting the graph:

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")


plt.xticks(range(new_indices), new_indices)
plt.xlim([-1, new_indices])
plt.savefig('fig1.png', dpi = 600)
plt.show()
plt.xticks(range(new_indices), new_indices)
plt.xlim([-1, new_indices])
plt.savefig('fig1.png', dpi = 600)
plt.show()

但是我得到了以下错误，所以我非常感谢支持来克服这个任务。

Feature ranking:
1. feature 660 (0.405876)
2. feature 655 (0.138400)
3. feature 659 (0.056848)
4. feature 658 (0.056631)
5. feature 321 (0.014537)
6. feature 336 (0.013202)
7. feature 324 (0.012455)
8. feature 322 (0.011517)
9. feature 656 (0.011493)
10. feature 650 (0.010850)
Traceback (most recent call last):
  File "random_forest.py", line 234, in <module>
    plt.xticks(range(new_indices), new_indices)
TypeError: only integer scalar arrays can be converted to a scalar index

Answer 1

这是重要特征的索引按降序排列的地方。这意味着使用 indices[:10] 可以获得前 10 个特征。

indices = np.argsort(importances)[::-1]
top_k = 10
new_indices = indices[:top_k]

所以你只需要相应地更改这部分（只需将 top_k 更改为你想要的值）：

# Print the feature ranking
print("Feature ranking:")

for f in range(top_k):
    print("%d. feature %d (%f)" % (f + 1, new_indices[f], importances[new_indices[f]]))

这里绘制图表也是如此：

#Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(top_k), importances[new_indices],
       color="r", yerr=std[new_indices], align="center")

#Edited here (put top_k in range)
plt.xticks(range(top_k), new_indices)
#Edited here (put top_k)
plt.xlim([-1, top_k])
plt.show()

如何从我的模型中获取包含 k 个最重要特征的图表？

How to get a graph with the k most important features from my model?

random-forest

scikit-learn