CountVectorizer() 不适用于单字母单词

Question

考虑到我必须对以下数据应用 CountVectorizer()：

words = [
     'A am is',
     'This the a',
     'the am is',
     'this a am',
]

我做了以下事情：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

它returns以下内容：

[[1 1 0 0]
 [0 0 1 1]
 [1 1 1 0]
 [1 0 0 1]]

供参考 print(vectorizer.get_feature_names()) 打印 ['am', 'is', 'the', 'this']

为什么 'a' 没有被读取？？
CountVectorizer()

中单字母单词不算单词吗

Answer 1

勾选doc

token_pattern

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

默认分词器忽略所有单字符分词。这就是 a 缺失的原因。

如果你想让单个字符标记出现在词汇表中，那么你必须使用 costume tokenizer。

示例代码

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=lambda txt: txt.split())
X = vectorizer.fit_transform(words)
print (vectorizer.get_feature_names())

输出：

['a', 'am', 'is', 'the', 'this']

CountVectorizer() 不适用于单字母单词

CountVectorizer() not working with single letter word

python

machine-learning

scikit-learn

countvectorizer

示例代码