CountVectorizer() 不适用于单字母单词
CountVectorizer() not working with single letter word
考虑到我必须对以下数据应用 CountVectorizer():
words = [
'A am is',
'This the a',
'the am is',
'this a am',
]
我做了以下事情:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
它returns以下内容:
[[1 1 0 0]
[0 0 1 1]
[1 1 1 0]
[1 0 0 1]]
供参考 print(vectorizer.get_feature_names())
打印 ['am', 'is', 'the', 'this']
为什么 'a' 没有被读取??
CountVectorizer()
中单字母单词不算单词吗
勾选doc
token_pattern
Regular expression denoting what constitutes a
“token”, only used if analyzer == 'word'. The default regexp select
tokens of 2 or more alphanumeric characters (punctuation is completely
ignored and always treated as a token separator).
默认分词器忽略所有单字符分词。这就是 a
缺失的原因。
如果你想让单个字符标记出现在词汇表中,那么你必须使用 costume tokenizer。
示例代码
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda txt: txt.split())
X = vectorizer.fit_transform(words)
print (vectorizer.get_feature_names())
输出:
['a', 'am', 'is', 'the', 'this']
考虑到我必须对以下数据应用 CountVectorizer():
words = [
'A am is',
'This the a',
'the am is',
'this a am',
]
我做了以下事情:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
它returns以下内容:
[[1 1 0 0]
[0 0 1 1]
[1 1 1 0]
[1 0 0 1]]
供参考 print(vectorizer.get_feature_names())
打印 ['am', 'is', 'the', 'this']
为什么 'a' 没有被读取??
CountVectorizer()
勾选doc
token_pattern
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
默认分词器忽略所有单字符分词。这就是 a
缺失的原因。
如果你想让单个字符标记出现在词汇表中,那么你必须使用 costume tokenizer。
示例代码
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda txt: txt.split())
X = vectorizer.fit_transform(words)
print (vectorizer.get_feature_names())
输出:
['a', 'am', 'is', 'the', 'this']