将来自 NLTK NaiveBayesClassifier 的信息量最大的特征存储在列表中

Store most informative features from NLTK NaiveBayesClassifier in a list

我正在 python 中尝试这个朴素贝叶斯分类器:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)

我有以下输出:

Console Output

可以清楚地看到哪些单词在 "important" 中出现得更多,哪些在 "spam" 类别中出现得更多。但是我无法使用这些值。我实际上想要一个看起来像这样的列表:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]

我是 python 的新手,很难弄清楚所有这些,有人可以帮忙吗?我会很感激的。

您可以稍微修改 source code of show_most_informative_features 以适合您的目的。

子列表的第一个元素对应于信息最丰富的特征名称,而第二个元素对应于它的标签(更具体地说,标签与比率的分子项相关联)。

辅助函数:

def show_most_informative_features_in_list(classifier, n=10):
    """
    Return a nested list of the "most informative" features 
    used by the classifier along with it's predominant labels
    """
    cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
    feature_list = []
    for (fname, fval) in classifier.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)
        labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                        key=labelprob)
        feature_list.append([fname, labels[-1]])
    return feature_list

nltk 的 positive/negative 电影评论语料库训练的分类器上对此进行测试:

show_most_informative_features_in_list(classifier, 10)

产生:

[['outstanding', 'pos'],
 ['ludicrous', 'neg'],
 ['avoids', 'pos'],
 ['astounding', 'pos'],
 ['idiotic', 'neg'],
 ['atrocious', 'neg'],
 ['offbeat', 'pos'],
 ['fascination', 'pos'],
 ['symbol', 'pos'],
 ['animators', 'pos']]

只需使用 most_informative_features()

使用 Classification using movie review corpus in NLTK/Python 中的示例:

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

然后,简单地:

print classifier.most_informative_features()

[输出]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True)]

并列出所有特征:

classifier.most_informative_features(n=len(word_features))

[输出]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True),
 ('commercially', True),
 ('utilize', True),
 ('gratuitous', True),
 ('visible', True),
 ('internet', True),
 ('disillusioned', True),
 ('boost', True),
 ('preventing', True),
 ('built', True),
 ('repairs', True),
 ('overplaying', True),
 ('election', True),
 ('caterer', True),
 ('decks', True),
 ('retiring', True),
 ('pivot', True),
 ('outwitting', True),
 ('solace', True),
 ('benches', True),
 ('terrorizes', True),
 ('billboard', True),
 ('catalogue', True),
 ('clean', True),
 ('skits', True),
 ('nice', True),
 ('feature', True),
 ('must', True),
 ('withdrawn', True),
 ('indulgence', True),
 ('tribal', True),
 ('freeman', True),
 ('must', False),
 ('nice', False),
 ('feature', False),
 ('gratuitous', False),
 ('turturro', False),
 ('built', False),
 ('internet', False),
 ('rescued', False),
 ('clean', False),
 ('overacts', False),
 ('gregor', False),
 ('conflicted', False),
 ('taboo', False),
 ('inhabiting', False),
 ('utilize', False),
 ('churns', False),
 ('boost', False),
 ('stepdaughter', False),
 ('complementary', False),
 ('gleiberman', False),
 ('skylar', False),
 ('kirkpatrick', False),
 ('hardship', False),
 ('election', False),
 ('inform', False),
 ('disillusioned', False),
 ('visible', False),
 ('commercially', False),
 ('frosted', False),
 ('pup', False),
 ('apologizing', False),
 ('freeman', False),
 ('preventing', False),
 ('nutsy', False),
 ('intrinsics', False),
 ('somalia', False),
 ('coordinators', False),
 ('strengthening', False),
 ('impatience', False),
 ('subtely', False),
 ('426', False),
 ('schreber', False),
 ('brimley', False),
 ('motherload', False),
 ('creepily', False),
 ('perturbed', False),
 ('accountants', False),
 ('beringer', False),
 ('scrubs', False),
 ('1830s', False),
 ('analogue', False),
 ('espouses', False),
 ('xv', False),
 ('skits', False),
 ('solace', False),
 ('reduncancy', False),
 ('parenthood', False),
 ('insulators', False),
 ('mccoll', False)]

澄清一下:

>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool

进一步说明,如果特征集中使用的标签是一个字符串,most_informative_features()将return一个字符串,例如

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative'  for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

并且:

>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
 ('inhabiting', 'positive'),
 ('conflicted', 'positive'),
 ('taboo', 'positive'),
 ('overacts', 'positive'),
 ('rescued', 'positive'),
 ('stepdaughter', 'positive'),
 ('pup', 'positive'),
 ('apologizing', 'positive'),
 ('inform', 'positive')]

>>> type(classifier.most_informative_features(10)[0][1])
str

朴素贝叶斯的信息量最大的特征(最有区别或区分的标记)将是两个 classes 之间 p(word | class)之间差异最大的那些值。

您必须先进行一些文本操作和标记化,这样您才能得到两个列表。存在于所有被标记为 class 的字符串中的所有标记的列表 A. 存在于被标记为 class 的所有字符串中的所有标记的另一个列表 B. 这两个列表应该包含我们可以重复的标记计算并创建频率分布。

运行 此代码:

classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]

这将从每个列表中获取前 3000 个特征,但您可以选择除 3000 之外的另一个数字。现在您已经有了一个频率分布,您可以计算 p (word | class),然后查看两个课程的区别。

diff = []
features = []
for feature in classA_word_features:
    features.append(feature)

    diff.append(classB_freq_distribution[feature]
    /len(classBWords) 
    - classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
    'Feature': features,
    'Diff': diff
})

然后您可以排序并查看最高和最低价值的词。

sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)