将来自 NLTK NaiveBayesClassifier 的信息量最大的特征存储在列表中
Store most informative features from NLTK NaiveBayesClassifier in a list
我正在 python 中尝试这个朴素贝叶斯分类器:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)
我有以下输出:
Console Output
可以清楚地看到哪些单词在 "important" 中出现得更多,哪些在 "spam" 类别中出现得更多。但是我无法使用这些值。我实际上想要一个看起来像这样的列表:
[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]
我是 python 的新手,很难弄清楚所有这些,有人可以帮忙吗?我会很感激的。
您可以稍微修改 source code of show_most_informative_features
以适合您的目的。
子列表的第一个元素对应于信息最丰富的特征名称,而第二个元素对应于它的标签(更具体地说,标签与比率的分子项相关联)。
辅助函数:
def show_most_informative_features_in_list(classifier, n=10):
"""
Return a nested list of the "most informative" features
used by the classifier along with it's predominant labels
"""
cpdist = classifier._feature_probdist # probability distribution for feature values given labels
feature_list = []
for (fname, fval) in classifier.most_informative_features(n):
def labelprob(l):
return cpdist[l, fname].prob(fval)
labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()],
key=labelprob)
feature_list.append([fname, labels[-1]])
return feature_list
在 nltk
的 positive/negative 电影评论语料库训练的分类器上对此进行测试:
show_most_informative_features_in_list(classifier, 10)
产生:
[['outstanding', 'pos'],
['ludicrous', 'neg'],
['avoids', 'pos'],
['astounding', 'pos'],
['idiotic', 'neg'],
['atrocious', 'neg'],
['offbeat', 'pos'],
['fascination', 'pos'],
['symbol', 'pos'],
['animators', 'pos']]
只需使用 most_informative_features()
使用 Classification using movie review corpus in NLTK/Python 中的示例:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
然后,简单地:
print classifier.most_informative_features()
[输出]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True)]
并列出所有特征:
classifier.most_informative_features(n=len(word_features))
[输出]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True),
('commercially', True),
('utilize', True),
('gratuitous', True),
('visible', True),
('internet', True),
('disillusioned', True),
('boost', True),
('preventing', True),
('built', True),
('repairs', True),
('overplaying', True),
('election', True),
('caterer', True),
('decks', True),
('retiring', True),
('pivot', True),
('outwitting', True),
('solace', True),
('benches', True),
('terrorizes', True),
('billboard', True),
('catalogue', True),
('clean', True),
('skits', True),
('nice', True),
('feature', True),
('must', True),
('withdrawn', True),
('indulgence', True),
('tribal', True),
('freeman', True),
('must', False),
('nice', False),
('feature', False),
('gratuitous', False),
('turturro', False),
('built', False),
('internet', False),
('rescued', False),
('clean', False),
('overacts', False),
('gregor', False),
('conflicted', False),
('taboo', False),
('inhabiting', False),
('utilize', False),
('churns', False),
('boost', False),
('stepdaughter', False),
('complementary', False),
('gleiberman', False),
('skylar', False),
('kirkpatrick', False),
('hardship', False),
('election', False),
('inform', False),
('disillusioned', False),
('visible', False),
('commercially', False),
('frosted', False),
('pup', False),
('apologizing', False),
('freeman', False),
('preventing', False),
('nutsy', False),
('intrinsics', False),
('somalia', False),
('coordinators', False),
('strengthening', False),
('impatience', False),
('subtely', False),
('426', False),
('schreber', False),
('brimley', False),
('motherload', False),
('creepily', False),
('perturbed', False),
('accountants', False),
('beringer', False),
('scrubs', False),
('1830s', False),
('analogue', False),
('espouses', False),
('xv', False),
('skits', False),
('solace', False),
('reduncancy', False),
('parenthood', False),
('insulators', False),
('mccoll', False)]
澄清一下:
>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool
进一步说明,如果特征集中使用的标签是一个字符串,most_informative_features()
将return一个字符串,例如
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
并且:
>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
('inhabiting', 'positive'),
('conflicted', 'positive'),
('taboo', 'positive'),
('overacts', 'positive'),
('rescued', 'positive'),
('stepdaughter', 'positive'),
('pup', 'positive'),
('apologizing', 'positive'),
('inform', 'positive')]
>>> type(classifier.most_informative_features(10)[0][1])
str
朴素贝叶斯的信息量最大的特征(最有区别或区分的标记)将是两个 classes 之间 p(word | class)之间差异最大的那些值。
您必须先进行一些文本操作和标记化,这样您才能得到两个列表。存在于所有被标记为 class 的字符串中的所有标记的列表 A. 存在于被标记为 class 的所有字符串中的所有标记的另一个列表 B. 这两个列表应该包含我们可以重复的标记计算并创建频率分布。
运行 此代码:
classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]
这将从每个列表中获取前 3000 个特征,但您可以选择除 3000 之外的另一个数字。现在您已经有了一个频率分布,您可以计算 p (word | class),然后查看两个课程的区别。
diff = []
features = []
for feature in classA_word_features:
features.append(feature)
diff.append(classB_freq_distribution[feature]
/len(classBWords)
- classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
'Feature': features,
'Diff': diff
})
然后您可以排序并查看最高和最低价值的词。
sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)
我正在 python 中尝试这个朴素贝叶斯分类器:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)
我有以下输出:
Console Output
可以清楚地看到哪些单词在 "important" 中出现得更多,哪些在 "spam" 类别中出现得更多。但是我无法使用这些值。我实际上想要一个看起来像这样的列表:
[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]
我是 python 的新手,很难弄清楚所有这些,有人可以帮忙吗?我会很感激的。
您可以稍微修改 source code of show_most_informative_features
以适合您的目的。
子列表的第一个元素对应于信息最丰富的特征名称,而第二个元素对应于它的标签(更具体地说,标签与比率的分子项相关联)。
辅助函数:
def show_most_informative_features_in_list(classifier, n=10):
"""
Return a nested list of the "most informative" features
used by the classifier along with it's predominant labels
"""
cpdist = classifier._feature_probdist # probability distribution for feature values given labels
feature_list = []
for (fname, fval) in classifier.most_informative_features(n):
def labelprob(l):
return cpdist[l, fname].prob(fval)
labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()],
key=labelprob)
feature_list.append([fname, labels[-1]])
return feature_list
在 nltk
的 positive/negative 电影评论语料库训练的分类器上对此进行测试:
show_most_informative_features_in_list(classifier, 10)
产生:
[['outstanding', 'pos'],
['ludicrous', 'neg'],
['avoids', 'pos'],
['astounding', 'pos'],
['idiotic', 'neg'],
['atrocious', 'neg'],
['offbeat', 'pos'],
['fascination', 'pos'],
['symbol', 'pos'],
['animators', 'pos']]
只需使用 most_informative_features()
使用 Classification using movie review corpus in NLTK/Python 中的示例:
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
然后,简单地:
print classifier.most_informative_features()
[输出]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True)]
并列出所有特征:
classifier.most_informative_features(n=len(word_features))
[输出]:
[('turturro', True),
('inhabiting', True),
('taboo', True),
('conflicted', True),
('overacts', True),
('rescued', True),
('stepdaughter', True),
('apologizing', True),
('pup', True),
('inform', True),
('commercially', True),
('utilize', True),
('gratuitous', True),
('visible', True),
('internet', True),
('disillusioned', True),
('boost', True),
('preventing', True),
('built', True),
('repairs', True),
('overplaying', True),
('election', True),
('caterer', True),
('decks', True),
('retiring', True),
('pivot', True),
('outwitting', True),
('solace', True),
('benches', True),
('terrorizes', True),
('billboard', True),
('catalogue', True),
('clean', True),
('skits', True),
('nice', True),
('feature', True),
('must', True),
('withdrawn', True),
('indulgence', True),
('tribal', True),
('freeman', True),
('must', False),
('nice', False),
('feature', False),
('gratuitous', False),
('turturro', False),
('built', False),
('internet', False),
('rescued', False),
('clean', False),
('overacts', False),
('gregor', False),
('conflicted', False),
('taboo', False),
('inhabiting', False),
('utilize', False),
('churns', False),
('boost', False),
('stepdaughter', False),
('complementary', False),
('gleiberman', False),
('skylar', False),
('kirkpatrick', False),
('hardship', False),
('election', False),
('inform', False),
('disillusioned', False),
('visible', False),
('commercially', False),
('frosted', False),
('pup', False),
('apologizing', False),
('freeman', False),
('preventing', False),
('nutsy', False),
('intrinsics', False),
('somalia', False),
('coordinators', False),
('strengthening', False),
('impatience', False),
('subtely', False),
('426', False),
('schreber', False),
('brimley', False),
('motherload', False),
('creepily', False),
('perturbed', False),
('accountants', False),
('beringer', False),
('scrubs', False),
('1830s', False),
('analogue', False),
('espouses', False),
('xv', False),
('skits', False),
('solace', False),
('reduncancy', False),
('parenthood', False),
('insulators', False),
('mccoll', False)]
澄清一下:
>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool
进一步说明,如果特征集中使用的标签是一个字符串,most_informative_features()
将return一个字符串,例如
import string
from itertools import chain
from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
并且:
>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
('inhabiting', 'positive'),
('conflicted', 'positive'),
('taboo', 'positive'),
('overacts', 'positive'),
('rescued', 'positive'),
('stepdaughter', 'positive'),
('pup', 'positive'),
('apologizing', 'positive'),
('inform', 'positive')]
>>> type(classifier.most_informative_features(10)[0][1])
str
朴素贝叶斯的信息量最大的特征(最有区别或区分的标记)将是两个 classes 之间 p(word | class)之间差异最大的那些值。
您必须先进行一些文本操作和标记化,这样您才能得到两个列表。存在于所有被标记为 class 的字符串中的所有标记的列表 A. 存在于被标记为 class 的所有字符串中的所有标记的另一个列表 B. 这两个列表应该包含我们可以重复的标记计算并创建频率分布。
运行 此代码:
classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]
这将从每个列表中获取前 3000 个特征,但您可以选择除 3000 之外的另一个数字。现在您已经有了一个频率分布,您可以计算 p (word | class),然后查看两个课程的区别。
diff = []
features = []
for feature in classA_word_features:
features.append(feature)
diff.append(classB_freq_distribution[feature]
/len(classBWords)
- classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
'Feature': features,
'Diff': diff
})
然后您可以排序并查看最高和最低价值的词。
sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)