CountVectorizer 将单词转换为小写
CountVectorizer converts words to lower case
在我的分类模型中,我需要保留大写字母,但是当我使用 sklearn countVectorizer 构建词汇表时,大写字母转换为小写字母!
为了排除隐式分词,我构建了一个分词器,它只传递文本而无需任何操作..
我的代码:
co = dict()
def tokenizeManu(txt):
return txt.split()
def corpDict(x):
print('1: ', x)
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
countFit = count.fit_transform(x)
vocab = count.get_feature_names()
dist = np.sum(countFit.toarray(), axis=0)
for tag, count in zip(vocab, dist):
co[str(tag)] = count
x = ['I\'m John Dev', 'We are the only']
corpDict(x)
print(co)
输出:
1: ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after
您可以将 lowercase
属性设置为 False
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu,lowercase=False)
这里是CountVectorizer
的属性
CountVectorizer(analyzer=u'word', binary=False, charset=None,
charset_error=None, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=0,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
如文档中所述,here。 CountVectorizer
有一个参数 lowercase
,默认为 True
。为了禁用此行为,您需要设置 lowercase=False
如下:
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)
在我的分类模型中,我需要保留大写字母,但是当我使用 sklearn countVectorizer 构建词汇表时,大写字母转换为小写字母!
为了排除隐式分词,我构建了一个分词器,它只传递文本而无需任何操作..
我的代码:
co = dict()
def tokenizeManu(txt):
return txt.split()
def corpDict(x):
print('1: ', x)
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu)
countFit = count.fit_transform(x)
vocab = count.get_feature_names()
dist = np.sum(countFit.toarray(), axis=0)
for tag, count in zip(vocab, dist):
co[str(tag)] = count
x = ['I\'m John Dev', 'We are the only']
corpDict(x)
print(co)
输出:
1: ["I'm John Dev", 'We are the only'] #<- before building the vocab.
{'john': 1, 'the': 1, 'we': 1, 'only': 1, 'dev': 1, "i'm": 1, 'are': 1} #<- after
您可以将 lowercase
属性设置为 False
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu,lowercase=False)
这里是CountVectorizer
CountVectorizer(analyzer=u'word', binary=False, charset=None,
charset_error=None, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=0,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
如文档中所述,here。 CountVectorizer
有一个参数 lowercase
,默认为 True
。为了禁用此行为,您需要设置 lowercase=False
如下:
count = CountVectorizer(ngram_range=(1, 1), tokenizer=tokenizeManu, lowercase=False)