使用 scikit learn 获取最多信息的特征有问题吗?

Problems obtaining most informative features with scikit learn?

我正在尝试从 textual corpus. From this well answered question 中获取最有用的特征,我知道这个任务可以按如下方式完成:

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

然后:

most_informative_feature_for_class(tfidf_vect, clf, 5)

对于这个分类器:

X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values


from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                    y, test_size=0.33)
clf = SVC(kernel='linear', C=1)
clf.fit(X, y)
prediction = clf.predict(X_test)

问题是most_informative_feature_for_class的输出:

5 a_base_de_bien bastante   (0, 2451)   -0.210683496368
  (0, 3533) -0.173621065386
  (0, 8034) -0.135543062425
  (0, 10346)    -0.173621065386
  (0, 15231)    -0.154148294738
  (0, 18261)    -0.158890483047
  (0, 21083)    -0.297476572586
  (0, 434)  -0.0596263855375
  (0, 446)  -0.0753492277856
  (0, 769)  -0.0753492277856
  (0, 1118) -0.0753492277856
  (0, 1439) -0.0753492277856
  (0, 1605) -0.0753492277856
  (0, 1755) -0.0637950312345
  (0, 3504) -0.0753492277856
  (0, 3511) -0.115802483001
  (0, 4382) -0.0668983049212
  (0, 5247) -0.315713152154
  (0, 5396) -0.0753492277856
  (0, 5753) -0.0716096348446
  (0, 6507) -0.130661516772
  (0, 7978) -0.0753492277856
  (0, 8296) -0.144739048504
  (0, 8740) -0.0753492277856
  (0, 8906) -0.0753492277856
  : :
  (0, 23282)    0.418623443832
  (0, 4100) 0.385906085143
  (0, 15735)    0.207958503155
  (0, 16620)    0.385906085143
  (0, 19974)    0.0936828782325
  (0, 20304)    0.385906085143
  (0, 21721)    0.385906085143
  (0, 22308)    0.301270427482
  (0, 14903)    0.314164150621
  (0, 16904)    0.0653764031957
  (0, 20805)    0.0597723455204
  (0, 21878)    0.403750815828
  (0, 22582)    0.0226150073272
  (0, 6532) 0.525138162099
  (0, 6670) 0.525138162099
  (0, 10341)    0.525138162099
  (0, 13627)    0.278332617058
  (0, 1600) 0.326774799211
  (0, 2074) 0.310556919237
  (0, 5262) 0.176400451433
  (0, 6373) 0.290124806858
  (0, 8593) 0.290124806858
  (0, 12002)    0.282832270298
  (0, 15008)    0.290124806858
  (0, 19207)    0.326774799211

它既不返回标签也不返回文字。为什么会发生这种情况,我怎样才能打印文字和标签?。自从我使用 pandas 读取数据以来,你们发生过这种情况吗?我尝试的另一件事是以下内容,形成这个 question:

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))


print_top10(tfidf_vect,clf,y)

但我得到了这个回溯:

回溯(最近调用最后):

  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module>
    print_top10(tfidf_vect,clf,5)
  File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10
    for i, class_label in enumerate(class_labels):
TypeError: 'int' object is not iterable

知道如何解决这个问题,以获得具有最高系数值的特征吗?

为了专门针对线性 SVM 解决这个问题,我们首先必须了解 sklearn 中 SVM 的公式及其与 MultinomialNB 的区别。

most_informative_feature_for_class 适用于 MultinomialNB 的原因是因为 coef_ 的输出本质上是给定 class 的特征的对数概率(因此大小为 [nclass, n_features],由于朴素贝叶斯问题的公式化。但是如果我们检查 SVM 的 documentationcoef_ 就不是那么简单了。而是 coef_(线性)SVM是 [n_classes * (n_classes -1)/2, n_features] 因为每个二元模型都适合每个可能的 class。

如果我们对我们感兴趣的特定系数有一些了解,我们可以将函数更改为如下所示:

def most_informative_feature_for_class_svm(vectorizer, classifier,  classlabel, n=10):
    labelid = ?? # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

这将按预期工作,并根据您所追求的系数向量打印出标签和前 n 个特征。

至于为特定 class 获得正确的输出,这将取决于假设和您的目标输出。我建议通读 SVM 文档中的 multi-class 文档,以了解您的需求。

所以使用 train.txt file which was described in this question,我们可以获得某种输出,尽管在这种情况下它不是特别具有描述性或有助于解释。希望对你有帮助。

import codecs, re, time
from itertools import chain

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

from sklearn.svm import SVC
svcc = SVC(kernel='linear', C=1)
svcc.fit(trainset, tags)

def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
print 
most_informative_feature_for_class_svm(word_vectorizer, svcc)

输出:

pt teve -4.63472898823
pt tive -4.63472898823
pt todas -4.63472898823
pt vida -4.63472898823
pt de -4.22926388012
pt foi -4.22926388012
pt mais -4.22926388012
pt me -4.22926388012
pt as -3.94158180767
pt que -3.94158180767

no 0.0204081632653
parecer 0.0204081632653
pone 0.0204081632653
por 0.0204081632653
relación 0.0204081632653
una 0.0204081632653
visto 0.0204081632653
ya 0.0204081632653
es 0.0408163265306
lo 0.0408163265306