在 Python 中结合 CountVectorizer 和 ngrams
Combining CountVectorizer and ngrams in Python
有一个任务是使用 ngrams 对男性和女性的名字进行分类。
所以,有一个像这样的数据框:
name is_male
Dorian 1
Jerzy 1
Deane 1
Doti 0
Betteann 0
Donella 0
具体要求是使用
from nltk.util import ngrams
为此任务,创建 ngram (n=2,3,4)
我做了一个名字列表,然后使用 ngrams:
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
test_ngrams = []
for name in name_list:
test_ngrams.append(list(ngrams(name,3)))
现在我需要以某种方式对所有这些进行矢量化以用于分类,我尝试
X_train = count_vect.fit_transform(test_ngrams)
收到:
AttributeError: 'list' object has no attribute 'lower'
我知道这里的列表是错误的输入类型,有人能解释一下我应该怎么做吗,这样我以后就可以使用 MultinomialNB,例如。我这样做完全正确吗?
提前致谢!
您正在将一系列列表传递给矢量化器,这就是您收到 AttributeError
的原因。相反,您应该传递一个可迭代的字符串。来自 CountVectorizer
documentation:
fit_transform(raw_documents, y=None)
Learn the vocabulary dictionary and return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
为了回答您的问题,CountVectorizer
能够使用 ngram_range
创建 N 元语法(以下生成二元语法):
count_vect = CountVectorizer(ngram_range=(2,2))
corpus = [
'This is the first document.',
'This is the second second document.',
]
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']
更新:
既然您提到您必须 使用 NLTK 生成 ngram,我们需要覆盖 CountVectorizer
的部分默认行为。即,将原始字符串转换为特征的 analyzer
:
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
[...]
If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.
由于我们已经提供了 ngrams,所以身份函数就足够了:
count_vect = CountVectorizer(
analyzer=lambda x:x
)
结合 NLTK ngrams 和 CountVectorizer 的完整示例:
corpus = [
'This is the first document.',
'This is the second second document.',
]
def build_ngrams(text, n=2):
tokens = text.lower().split()
return list(nltk.ngrams(tokens, n))
corpus = [build_ngrams(document) for document in corpus]
count_vect = CountVectorizer(
analyzer=lambda x:x
)
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]
有一个任务是使用 ngrams 对男性和女性的名字进行分类。 所以,有一个像这样的数据框:
name is_male
Dorian 1
Jerzy 1
Deane 1
Doti 0
Betteann 0
Donella 0
具体要求是使用
from nltk.util import ngrams
为此任务,创建 ngram (n=2,3,4)
我做了一个名字列表,然后使用 ngrams:
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
test_ngrams = []
for name in name_list:
test_ngrams.append(list(ngrams(name,3)))
现在我需要以某种方式对所有这些进行矢量化以用于分类,我尝试
X_train = count_vect.fit_transform(test_ngrams)
收到:
AttributeError: 'list' object has no attribute 'lower'
我知道这里的列表是错误的输入类型,有人能解释一下我应该怎么做吗,这样我以后就可以使用 MultinomialNB,例如。我这样做完全正确吗? 提前致谢!
您正在将一系列列表传递给矢量化器,这就是您收到 AttributeError
的原因。相反,您应该传递一个可迭代的字符串。来自 CountVectorizer
documentation:
fit_transform(raw_documents, y=None)
Learn the vocabulary dictionary and return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
为了回答您的问题,CountVectorizer
能够使用 ngram_range
创建 N 元语法(以下生成二元语法):
count_vect = CountVectorizer(ngram_range=(2,2))
corpus = [
'This is the first document.',
'This is the second second document.',
]
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']
更新:
既然您提到您必须 使用 NLTK 生成 ngram,我们需要覆盖 CountVectorizer
的部分默认行为。即,将原始字符串转换为特征的 analyzer
:
analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable
[...]
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
由于我们已经提供了 ngrams,所以身份函数就足够了:
count_vect = CountVectorizer(
analyzer=lambda x:x
)
结合 NLTK ngrams 和 CountVectorizer 的完整示例:
corpus = [
'This is the first document.',
'This is the second second document.',
]
def build_ngrams(text, n=2):
tokens = text.lower().split()
return list(nltk.ngrams(tokens, n))
corpus = [build_ngrams(document) for document in corpus]
count_vect = CountVectorizer(
analyzer=lambda x:x
)
X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]