如何检查文本特征的特征重要性?
How to check feature importances on text feature?
首先,我还在研究情感分析上的分类器比较。然后,我想知道每个分类器上每个特征的重要性。
我已经尝试过 model.feature_importances_
,但是因为我对我的数据序列进行了矢量化,所以我无法理解这些特征重要性的含义。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
line = pd.read_csv('line_label.csv', encoding = "ISO-8859-1")
x = line.Berita
y = line.Sentimen
xcv = x
xtf = x
countvect = CountVectorizer(analyzer = "word", tokenizer = None, lowercase = None)
xcv = countvect.fit_transform(x).toarray()
X_train, X_test, y_train, y_test = train_test_split(xcv, y, test_size=0.01, random_state=42)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
rf.feature_importances_
显示
array([2.20854745e-04, 1.24760561e-04, 3.14268988e-03, ...,
1.71782391e-04, 5.15755286e-05, 2.13065348e-08])
使用下面的代码:
for feature, importance in zip(countvect.get_feature_names(), rf.feature_importances_):
print('{}: {}'.format(feature, importance))
首先,我还在研究情感分析上的分类器比较。然后,我想知道每个分类器上每个特征的重要性。
我已经尝试过 model.feature_importances_
,但是因为我对我的数据序列进行了矢量化,所以我无法理解这些特征重要性的含义。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
line = pd.read_csv('line_label.csv', encoding = "ISO-8859-1")
x = line.Berita
y = line.Sentimen
xcv = x
xtf = x
countvect = CountVectorizer(analyzer = "word", tokenizer = None, lowercase = None)
xcv = countvect.fit_transform(x).toarray()
X_train, X_test, y_train, y_test = train_test_split(xcv, y, test_size=0.01, random_state=42)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
rf.feature_importances_
显示
array([2.20854745e-04, 1.24760561e-04, 3.14268988e-03, ...,
1.71782391e-04, 5.15755286e-05, 2.13065348e-08])
使用下面的代码:
for feature, importance in zip(countvect.get_feature_names(), rf.feature_importances_):
print('{}: {}'.format(feature, importance))