当有两个单词时,为什么 CountVectorizer 会为二元组抛出 "Empty Vocabulary error"?
Why does CountVectorizer throw an "Empty Vocabulary error" for a bigram when there are two words?
我有一个 CountVectorizer:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
实施矢量器:
X = word_vectorizer.fit_transform(group['cleanComments'])
引发此错误:
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
当 nGram 从中提取的文档是此字符串时,会发生此错误:"duplicate q"。它发生在文档为“ ”的任何时候。
为什么 CountVectorizer 不将 q(或与此相关的任何单个字母)作为有效单词?是否有任何全面的地方列出了 CountVectorizer 抛出此错误的可能原因?
编辑:
我对错误本身做了更多的挖掘,看起来它与词汇有关。我假设标准词汇表不接受单个字母作为单词,但我不确定如何解决这个问题。
_count_vocab()
函数抛出此错误,这是 CountVectorizer
class 的一个方法。 class 附带了一个 token_pattern
,它定义了什么算作一个词。 token_pattern
参数注释的文档:
The default regexp select tokens of 2 or more alphanumeric characters
我们可以在 __init__
的默认参数中明确地看到这一点:
token_pattern=r"(?u)\b\w\w+\b"
如果您想允许单字母单词,只需从此模式中删除第一个 \w
并在实例化您的 CountVectorizer
时明确设置 token_pattern
:
CountVectorizer(token_pattern=r"(?u)\b\w+\b",
stop_words=None, ngram_range=(2,2), analyzer='word')
我有一个 CountVectorizer:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
实施矢量器:
X = word_vectorizer.fit_transform(group['cleanComments'])
引发此错误:
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
当 nGram 从中提取的文档是此字符串时,会发生此错误:"duplicate q"。它发生在文档为“ ”的任何时候。
为什么 CountVectorizer 不将 q(或与此相关的任何单个字母)作为有效单词?是否有任何全面的地方列出了 CountVectorizer 抛出此错误的可能原因?
编辑: 我对错误本身做了更多的挖掘,看起来它与词汇有关。我假设标准词汇表不接受单个字母作为单词,但我不确定如何解决这个问题。
_count_vocab()
函数抛出此错误,这是 CountVectorizer
class 的一个方法。 class 附带了一个 token_pattern
,它定义了什么算作一个词。 token_pattern
参数注释的文档:
The default regexp select tokens of 2 or more alphanumeric characters
我们可以在 __init__
的默认参数中明确地看到这一点:
token_pattern=r"(?u)\b\w\w+\b"
如果您想允许单字母单词,只需从此模式中删除第一个 \w
并在实例化您的 CountVectorizer
时明确设置 token_pattern
:
CountVectorizer(token_pattern=r"(?u)\b\w+\b",
stop_words=None, ngram_range=(2,2), analyzer='word')