如何使用 sklearn 的 CountVectorizerand() 来获取包含任何标点符号作为单独标记的 ngram?
How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?
我使用 sklearn.feature_extraction.text.CountVectorizer 来计算 n-gram。示例:
import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
输出:
4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
标点符号已删除:如何将它们作为单独的标记包括在内?
您应该使用 tokenizer
参数指定单词 tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer 实例。
例如,nltk.tokenize.TreebankWordTokenizer
将大多数标点字符视为单独的标记:
import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
输出:
4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python',
u"it 's pretty awesome", u'like python , it', u"python , it 's",
u'really like python ,']
我使用 sklearn.feature_extraction.text.CountVectorizer 来计算 n-gram。示例:
import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
输出:
4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
标点符号已删除:如何将它们作为单独的标记包括在内?
您应该使用 tokenizer
参数指定单词 tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer 实例。
例如,nltk.tokenize.TreebankWordTokenizer
将大多数标点字符视为单独的标记:
import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), \
tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
输出:
4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python',
u"it 's pretty awesome", u'like python , it', u"python , it 's",
u'really like python ,']