为 Python 机器学习(朴素贝叶斯)算法创建特征字典

Creating a feature dictionary for Python machine learning (naive bayes) algorithm

例如,我想使用姓氏预测中国人与非中国人。特别是我想从姓氏中提取三个字母的子字符串。因此,例如,姓氏 "gao" 将提供一个特征 "gao" 而 "chan" 将提供两个特征 "cha" 和 "han".

在下面的three_split函数中拆分成功。但据我所知,要将其合并为一个功能集,我需要 return 将输出作为字典。关于如何做到这一点的任何提示?对于 "Chan" 的字典,字典应该 return "cha" 和 "han" 为 TRUE。

from nltk.classify import PositiveNaiveBayesClassifier
import re

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']

def three_split(word):
    word = word.lower()
    word = word.replace(" ", "_")
    split = 3
    return [word[start:start+split] for start in range(0, len(word)-2)]

positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))

经过反复试验,我想我已经掌握了。谢谢

from nltk.classify import PositiveNaiveBayesClassifier
import re

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']

def three_split(word):
    word = word.lower()
    word = word.replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

name = "dennis kidd"
print three_split(name)
print classifier.classify(three_split(name))

这是一个白盒答案:

使用您的原始代码,它输出:

Traceback (most recent call last):
  File "test.py", line 17, in <module>
    unlabeled_featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/positivenaivebayes.py", line 108, in train
    for fname, fval in featureset.items():
AttributeError: 'list' object has no attribute 'items'

查看第 17 行:

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

似乎 PositiveNaiveBayesClassifier 需要一个具有属性 '.items()' 的对象,如果 NLTK 代码是 pythonic,直观上它应该是 dict

查看https://github.com/nltk/nltk/blob/develop/nltk/classify/positivenaivebayes.py#L88,没有明确解释positive_featuresets参数应该包含什么:

:param positive_featuresets: A list of featuresets that are known as positive examples (i.e., their label is True).

检查文档字符串,我们看到这个例子:

Example:
    >>> from nltk.classify import PositiveNaiveBayesClassifier
Some sentences about sports:
    >>> sports_sentences = [ 'The team dominated the game',
    ...                      'They lost the ball',
    ...                      'The game was intense',
    ...                      'The goalkeeper catched the ball',
    ...                      'The other team controlled the ball' ]
Mixed topics, including sports:
    >>> various_sentences = [ 'The President did not comment',
    ...                       'I lost the keys',
    ...                       'The team won the game',
    ...                       'Sara has two kids',
    ...                       'The ball went off the court',
    ...                       'They had the ball for the whole game',
    ...                       'The show is over' ]
The features of a sentence are simply the words it contains:
    >>> def features(sentence):
    ...     words = sentence.lower().split()
    ...     return dict(('contains(%s)' % w, True) for w in words)
We use the sports sentences as positive examples, the mixed ones ad unlabeled examples:
    >>> positive_featuresets = list(map(features, sports_sentences))
    >>> unlabeled_featuresets = list(map(features, various_sentences))
    >>> classifier = PositiveNaiveBayesClassifier.train(positive_featuresets,
    ...                                                 unlabeled_featuresets)

现在我们找到feature()将句子转换为特征的函数和returns

dict(('contains(%s)' % w, True) for w in words)

基本上这就是有能力调用.items()的东西。看看字典理解,似乎 'contains(%s)' % w 有点多余,除非它是为了人类的可读性。所以你可以只使用 dict((w, True) for w in words).

此外,将 space 替换为下划线也可能是多余的,除非以后需要它。

最后,切片和有限迭代可以替换为可以提取字符 ngram 的 ngram 函数,例如

>>> word = 'alexgao'
>>> split=3
>>> [word[start:start+split] for start in range(0, len(word)-2)]
['ale', 'lex', 'exg', 'xga', 'gao']
# With ngrams
>>> from nltk.util import ngrams
>>> ["".join(ng) for ng in ngrams(word,3)]
['ale', 'lex', 'exg', 'xga', 'gao']

您的特征提取函数可以这样简化:

from nltk.util import ngrams
def three_split(word):
    return dict(("".join(ng, True) for ng in ngrams(word.lower(),3))

[输出]:

{'im ': True, 'm s': True, 'jim': True, 'ilv': True, ' si': True, 'lva': True, 'sil': True}
False

事实上,NLTK 分类器用途广泛,您可以使用字符元组作为特征,因此在提取特征时无需修补 ngram,即:

from nltk.classify import PositiveNaiveBayesClassifier
import re
from nltk.util import ngrams

chinese_names = ['gao', 'chan', 'chen', 'Tsai', 'liu', 'Lee']

nonchinese_names = ['silva', 'anderson', 'kidd', 'bryant', 'Jones', 'harris', 'davis']


def three_split(word):
    return dict(((ng, True) for ng in ngrams(word.lower(),3))

positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))

classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, 
    unlabeled_featuresets)

print three_split("Jim Silva")
print classifier.classify(three_split("Jim Silva"))

[输出]:

{('m', ' ', 's'): True, ('j', 'i', 'm'): True, ('s', 'i', 'l'): True, ('i', 'l', 'v'): True, (' ', 's', 'i'): True, ('l', 'v', 'a'): True, ('i', 'm', ' '): True}