用于整数特征的 NLTK 分类器?
NLTK classifier for integer features?
我的特征向量中有整数类型特征,NLTK NaiveBayesClassifier
将其视为标称值。
上下文
我正在尝试使用 n-gram 构建语言分类器。例如,二元字母“th”在英语中比在法语中更常见。
对于我训练集中的每个句子,我提取如下特征:bigram(th): 5
其中 5(示例)表示二元组“th”在句子中出现的次数。
当我尝试构建具有此类特征的分类器并检查信息量最大的特征时,我意识到分类器没有意识到此类特征是线性的。例如,它可能会将 bigram(ea): 4
视为法语,将 bigram(ea): 5
视为英语,将 bigram(ea): 6
再次视为法语。这是相当武断的,并不代表二元语法在英语或法语中更常见的逻辑。这就是为什么我需要这样对待整数。
更多想法
当然,我可以用 has(th): True
等功能替换这些功能。但是,我认为这是一个坏主意,因为具有 1 个 'th' 实例的法语句子和具有 5 个 'th' 实例的英语句子都将具有无法区分它们的特征 has(th): True
。
我也找到了 this relevant link 但它没有给我答案。
特征提取器
我的特征提取器如下所示:
def get_ngrams(word, n):
ngrams_list = []
ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
format_string = ''
for i in range(0, n):
format_string += ('%s')
ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
return ngrams_list_flat
# Feature extractor
def get_ngram_features(sentence_tokens):
features = {}
# Unigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 1)
for ngram in ngrams:
features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
# Bigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 2)
for ngram in ngrams:
features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
# Trigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 3)
for ngram in ngrams:
features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
# Quadrigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 4)
for ngram in ngrams:
features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
return features
特征提取示例
get_ngram_features(['test', 'sentence'])
Returns:
{'char(c)': 1,
'char(e)': 4,
'char(n)': 2,
'char(s)': 2,
'char(t)': 3,
'bigram(_s)': 1,
'bigram(_t)': 1,
'bigram(ce)': 1,
'bigram(e_)': 1,
'bigram(en)': 2,
'bigram(es)': 1,
'bigram(nc)': 1,
'bigram(nt)': 1,
'bigram(se)': 1,
'bigram(st)': 1,
'bigram(t_)': 1,
'bigram(te)': 2,
'quadrigram(_sen)': 1,
'quadrigram(_tes)': 1,
'quadrigram(ence)': 1,
'quadrigram(ente)': 1,
'quadrigram(est_)': 1,
'quadrigram(nce_)': 1,
'quadrigram(nten)': 1,
'quadrigram(sent)': 1,
'quadrigram(tenc)': 1,
'quadrigram(test)': 1,
'trigram(_se)': 1,
'trigram(_te)': 1,
'trigram(ce_)': 1,
'trigram(enc)': 1,
'trigram(ent)': 1,
'trigram(est)': 1,
'trigram(nce)': 1,
'trigram(nte)': 1,
'trigram(sen)': 1,
'trigram(st_)': 1,
'trigram(ten)': 1,
'trigram(tes)': 1}
TL;DR
为此目的使用其他库更容易。使用自定义分析器,使用 sklearn
做这样的 https://www.kaggle.com/alvations/basic-nlp-with-nltk 更容易,例如CountVectorizer(analyzer=preprocess_text)
例如:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk import everygrams
def sent_process(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and '\n' not in ng and ng != ('_',)]
sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
sent3 = 'Mr brown quickly jumps over the lazy dog.'
sent4 = 'The brown quickly jumps over the lazy fox.'
with StringIO('\n'.join([sent1, sent2])) as fin:
# Override the analyzer totally with our preprocess text
count_vect = CountVectorizer(analyzer=sent_process)
count_vect.fit_transform(fin)
count_vect.vocabulary_
train_set = count_vect.fit_transform([sent1, sent2])
# To train the classifier
clf = MultinomialNB()
clf.fit(train_set, ['pos', 'neg'])
test_set = count_vect.transform([sent3, sent4])
clf.predict(test_set)
Cut-away
首先,确实没有必要明确标记特征的 char(...)
、unigram(...)
、bigram(...)
、trigram(...)
和 quadrigram(...)
部分。
特征集只是字典键,您可以使用实际的 ngram 元组作为键,例如
from collections import Counter
from nltk import ngrams, word_tokenize
features = Counter(ngrams(word_tokenize('This is a something foo foo bar foo foo sentence'), 2))
[输出]:
>>> features
Counter({('This', 'is'): 1,
('a', 'something'): 1,
('bar', 'foo'): 1,
('foo', 'bar'): 1,
('foo', 'foo'): 2,
('foo', 'sentence'): 1,
('is', 'a'): 1,
('something', 'foo'): 1})
对于几个订单的ngram,可以使用everygrams()
,例如
from nltk import everygrams
sent = word_tokenize('This is a something foo foo bar foo foo sentence')
Counter(everygrams(sent, 1, 4))
[输出]:
Counter({('This',): 1,
('This', 'is'): 1,
('This', 'is', 'a'): 1,
('This', 'is', 'a', 'something'): 1,
('a',): 1,
('a', 'something'): 1,
('a', 'something', 'foo'): 1,
('a', 'something', 'foo', 'foo'): 1,
('bar',): 1,
('bar', 'foo'): 1,
('bar', 'foo', 'foo'): 1,
('bar', 'foo', 'foo', 'sentence'): 1,
('foo',): 4,
('foo', 'bar'): 1,
('foo', 'bar', 'foo'): 1,
('foo', 'bar', 'foo', 'foo'): 1,
('foo', 'foo'): 2,
('foo', 'foo', 'bar'): 1,
('foo', 'foo', 'bar', 'foo'): 1,
('foo', 'foo', 'sentence'): 1,
('foo', 'sentence'): 1,
('is',): 1,
('is', 'a'): 1,
('is', 'a', 'something'): 1,
('is', 'a', 'something', 'foo'): 1,
('sentence',): 1,
('something',): 1,
('something', 'foo'): 1,
('something', 'foo', 'foo'): 1,
('something', 'foo', 'foo', 'bar'): 1})
一种提取所需特征的简洁方法:
def sent_vectorizer(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and ng != ('_',)]
Counter(sent_vectorizer('This is a something foo foo bar foo foo sentence'))
[输出]:
Counter({'o': 9, 's': 4, 'e': 4, 'f': 4, '_f': 4, 'fo': 4, 'oo': 4, 'o_': 4, '_fo': 4, 'foo': 4, 'oo_': 4, '_foo': 4, 'foo_': 4, 'i': 3, 'n': 3, 'h': 2, 'a': 2, 't': 2, 'hi': 2, 'is': 2, 's_': 2, '_s': 2, 'en': 2, 'is_': 2, 'T': 1, 'm': 1, 'g': 1, 'b': 1, 'r': 1, 'c': 1, 'Th': 1, '_i': 1, '_a': 1, 'a_': 1, 'so': 1, 'om': 1, 'me': 1, 'et': 1, 'th': 1, 'in': 1, 'ng': 1, 'g_': 1, '_b': 1, 'ba': 1, 'ar': 1, 'r_': 1, 'se': 1, 'nt': 1, 'te': 1, 'nc': 1, 'ce': 1, 'Thi': 1, 'his': 1, '_is': 1, '_a_': 1, '_so': 1, 'som': 1, 'ome': 1, 'met': 1, 'eth': 1, 'thi': 1, 'hin': 1, 'ing': 1, 'ng_': 1, '_ba': 1, 'bar': 1, 'ar_': 1, '_se': 1, 'sen': 1, 'ent': 1, 'nte': 1, 'ten': 1, 'enc': 1, 'nce': 1, 'This': 1, 'his_': 1, '_is_': 1, '_som': 1, 'some': 1, 'omet': 1, 'meth': 1, 'ethi': 1, 'thin': 1, 'hing': 1, 'ing_': 1, '_bar': 1, 'bar_': 1, '_sen': 1, 'sent': 1, 'ente': 1, 'nten': 1, 'tenc': 1, 'ence': 1})
中龙
不幸的是,没有简单的方法来更改 NLTK 中 NaiveBayesClassifier
工作方式的硬编码方式。
如果我们看 https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L185 ,在幕后 NLTK 已经在计算特征中出现的次数。
但请注意,它计算的是文档频率,而不是词频,即在这种情况下,无论一个元素在文档中出现多少次,它都算作一个。如果不更改 NLTK 代码来添加每个功能的值,就没有一种干净的方法,因为它被硬编码为 +=1
、https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L201
我的特征向量中有整数类型特征,NLTK NaiveBayesClassifier
将其视为标称值。
上下文
我正在尝试使用 n-gram 构建语言分类器。例如,二元字母“th”在英语中比在法语中更常见。
对于我训练集中的每个句子,我提取如下特征:bigram(th): 5
其中 5(示例)表示二元组“th”在句子中出现的次数。
当我尝试构建具有此类特征的分类器并检查信息量最大的特征时,我意识到分类器没有意识到此类特征是线性的。例如,它可能会将 bigram(ea): 4
视为法语,将 bigram(ea): 5
视为英语,将 bigram(ea): 6
再次视为法语。这是相当武断的,并不代表二元语法在英语或法语中更常见的逻辑。这就是为什么我需要这样对待整数。
更多想法
当然,我可以用 has(th): True
等功能替换这些功能。但是,我认为这是一个坏主意,因为具有 1 个 'th' 实例的法语句子和具有 5 个 'th' 实例的英语句子都将具有无法区分它们的特征 has(th): True
。
我也找到了 this relevant link 但它没有给我答案。
特征提取器
我的特征提取器如下所示:
def get_ngrams(word, n):
ngrams_list = []
ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
format_string = ''
for i in range(0, n):
format_string += ('%s')
ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
return ngrams_list_flat
# Feature extractor
def get_ngram_features(sentence_tokens):
features = {}
# Unigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 1)
for ngram in ngrams:
features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
# Bigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 2)
for ngram in ngrams:
features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
# Trigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 3)
for ngram in ngrams:
features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
# Quadrigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 4)
for ngram in ngrams:
features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
return features
特征提取示例
get_ngram_features(['test', 'sentence'])
Returns:
{'char(c)': 1,
'char(e)': 4,
'char(n)': 2,
'char(s)': 2,
'char(t)': 3,
'bigram(_s)': 1,
'bigram(_t)': 1,
'bigram(ce)': 1,
'bigram(e_)': 1,
'bigram(en)': 2,
'bigram(es)': 1,
'bigram(nc)': 1,
'bigram(nt)': 1,
'bigram(se)': 1,
'bigram(st)': 1,
'bigram(t_)': 1,
'bigram(te)': 2,
'quadrigram(_sen)': 1,
'quadrigram(_tes)': 1,
'quadrigram(ence)': 1,
'quadrigram(ente)': 1,
'quadrigram(est_)': 1,
'quadrigram(nce_)': 1,
'quadrigram(nten)': 1,
'quadrigram(sent)': 1,
'quadrigram(tenc)': 1,
'quadrigram(test)': 1,
'trigram(_se)': 1,
'trigram(_te)': 1,
'trigram(ce_)': 1,
'trigram(enc)': 1,
'trigram(ent)': 1,
'trigram(est)': 1,
'trigram(nce)': 1,
'trigram(nte)': 1,
'trigram(sen)': 1,
'trigram(st_)': 1,
'trigram(ten)': 1,
'trigram(tes)': 1}
TL;DR
为此目的使用其他库更容易。使用自定义分析器,使用 sklearn
做这样的 https://www.kaggle.com/alvations/basic-nlp-with-nltk 更容易,例如CountVectorizer(analyzer=preprocess_text)
例如:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk import everygrams
def sent_process(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and '\n' not in ng and ng != ('_',)]
sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
sent3 = 'Mr brown quickly jumps over the lazy dog.'
sent4 = 'The brown quickly jumps over the lazy fox.'
with StringIO('\n'.join([sent1, sent2])) as fin:
# Override the analyzer totally with our preprocess text
count_vect = CountVectorizer(analyzer=sent_process)
count_vect.fit_transform(fin)
count_vect.vocabulary_
train_set = count_vect.fit_transform([sent1, sent2])
# To train the classifier
clf = MultinomialNB()
clf.fit(train_set, ['pos', 'neg'])
test_set = count_vect.transform([sent3, sent4])
clf.predict(test_set)
Cut-away
首先,确实没有必要明确标记特征的 char(...)
、unigram(...)
、bigram(...)
、trigram(...)
和 quadrigram(...)
部分。
特征集只是字典键,您可以使用实际的 ngram 元组作为键,例如
from collections import Counter
from nltk import ngrams, word_tokenize
features = Counter(ngrams(word_tokenize('This is a something foo foo bar foo foo sentence'), 2))
[输出]:
>>> features
Counter({('This', 'is'): 1,
('a', 'something'): 1,
('bar', 'foo'): 1,
('foo', 'bar'): 1,
('foo', 'foo'): 2,
('foo', 'sentence'): 1,
('is', 'a'): 1,
('something', 'foo'): 1})
对于几个订单的ngram,可以使用everygrams()
,例如
from nltk import everygrams
sent = word_tokenize('This is a something foo foo bar foo foo sentence')
Counter(everygrams(sent, 1, 4))
[输出]:
Counter({('This',): 1,
('This', 'is'): 1,
('This', 'is', 'a'): 1,
('This', 'is', 'a', 'something'): 1,
('a',): 1,
('a', 'something'): 1,
('a', 'something', 'foo'): 1,
('a', 'something', 'foo', 'foo'): 1,
('bar',): 1,
('bar', 'foo'): 1,
('bar', 'foo', 'foo'): 1,
('bar', 'foo', 'foo', 'sentence'): 1,
('foo',): 4,
('foo', 'bar'): 1,
('foo', 'bar', 'foo'): 1,
('foo', 'bar', 'foo', 'foo'): 1,
('foo', 'foo'): 2,
('foo', 'foo', 'bar'): 1,
('foo', 'foo', 'bar', 'foo'): 1,
('foo', 'foo', 'sentence'): 1,
('foo', 'sentence'): 1,
('is',): 1,
('is', 'a'): 1,
('is', 'a', 'something'): 1,
('is', 'a', 'something', 'foo'): 1,
('sentence',): 1,
('something',): 1,
('something', 'foo'): 1,
('something', 'foo', 'foo'): 1,
('something', 'foo', 'foo', 'bar'): 1})
一种提取所需特征的简洁方法:
def sent_vectorizer(sent):
return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4)
if ' ' not in ng and ng != ('_',)]
Counter(sent_vectorizer('This is a something foo foo bar foo foo sentence'))
[输出]:
Counter({'o': 9, 's': 4, 'e': 4, 'f': 4, '_f': 4, 'fo': 4, 'oo': 4, 'o_': 4, '_fo': 4, 'foo': 4, 'oo_': 4, '_foo': 4, 'foo_': 4, 'i': 3, 'n': 3, 'h': 2, 'a': 2, 't': 2, 'hi': 2, 'is': 2, 's_': 2, '_s': 2, 'en': 2, 'is_': 2, 'T': 1, 'm': 1, 'g': 1, 'b': 1, 'r': 1, 'c': 1, 'Th': 1, '_i': 1, '_a': 1, 'a_': 1, 'so': 1, 'om': 1, 'me': 1, 'et': 1, 'th': 1, 'in': 1, 'ng': 1, 'g_': 1, '_b': 1, 'ba': 1, 'ar': 1, 'r_': 1, 'se': 1, 'nt': 1, 'te': 1, 'nc': 1, 'ce': 1, 'Thi': 1, 'his': 1, '_is': 1, '_a_': 1, '_so': 1, 'som': 1, 'ome': 1, 'met': 1, 'eth': 1, 'thi': 1, 'hin': 1, 'ing': 1, 'ng_': 1, '_ba': 1, 'bar': 1, 'ar_': 1, '_se': 1, 'sen': 1, 'ent': 1, 'nte': 1, 'ten': 1, 'enc': 1, 'nce': 1, 'This': 1, 'his_': 1, '_is_': 1, '_som': 1, 'some': 1, 'omet': 1, 'meth': 1, 'ethi': 1, 'thin': 1, 'hing': 1, 'ing_': 1, '_bar': 1, 'bar_': 1, '_sen': 1, 'sent': 1, 'ente': 1, 'nten': 1, 'tenc': 1, 'ence': 1})
中龙
不幸的是,没有简单的方法来更改 NLTK 中 NaiveBayesClassifier
工作方式的硬编码方式。
如果我们看 https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L185 ,在幕后 NLTK 已经在计算特征中出现的次数。
但请注意,它计算的是文档频率,而不是词频,即在这种情况下,无论一个元素在文档中出现多少次,它都算作一个。如果不更改 NLTK 代码来添加每个功能的值,就没有一种干净的方法,因为它被硬编码为 +=1
、https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L201