应用词袋
Applying Bag of words
嘿,我正在处理词袋,我正在尝试实现,所以假设我有下面的语料库,但我不想使用 print( vectorizer.fit_transform(corpus).todense() )
作为词汇表,而是我创建了一个像
{u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4}
我如何使用这个词汇表来生成矩阵?
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
使用您的自定义词汇表实例化您的 CountVectorizer,然后转换您的语料库。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vocabulary = {u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4}
vectorizer = CountVectorizer(vocabulary=vocabulary)
print( vectorizer.transform(corpus).todense() )
[[1 0 0 0 0]
[0 0 0 1 0]
[0 0 0 0 0]
[0 1 1 0 1]]
print( vectorizer.vocabulary_ )
{'all': 0, 'sunshine': 1, 'some': 2, 'down': 3, 'reason': 4}
嘿,我正在处理词袋,我正在尝试实现,所以假设我有下面的语料库,但我不想使用 print( vectorizer.fit_transform(corpus).todense() )
作为词汇表,而是我创建了一个像
{u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4}
我如何使用这个词汇表来生成矩阵?
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
使用您的自定义词汇表实例化您的 CountVectorizer,然后转换您的语料库。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cat from outer space',
'Sunshine loves to sit like this for some reason.'
]
vocabulary = {u'all': 0, u'sunshine': 1, u'some': 2, u'down': 3, u'reason': 4}
vectorizer = CountVectorizer(vocabulary=vocabulary)
print( vectorizer.transform(corpus).todense() )
[[1 0 0 0 0]
[0 0 0 1 0]
[0 0 0 0 0]
[0 1 1 0 1]]
print( vectorizer.vocabulary_ )
{'all': 0, 'sunshine': 1, 'some': 2, 'down': 3, 'reason': 4}