CountVectorizer 在短词上引发错误
CountVectorizer raising error on short words
有人能解释一下为什么当我尝试 fit_transform 任何短词时 CountVectorizer 会引发此错误吗?即使我使用 stopwords=None 我仍然会遇到同样的错误。
这是代码
from sklearn.feature_extraction.text import CountVectorizer
text = ['don\'t know when I shall return to the continuation of my scientific work. At the moment I can do absolutely nothing with it, and limit myself to the most necessary duty of my lectures; how much happier I would be to be scientifically active, if only I had the necessary mental freshness.']
cv = CountVectorizer(stop_words=None).fit(text)
并按预期工作。然后,如果我尝试 fit_transform 使用另一个文本
cv.fit_transform(['q'])
并且引发的错误是
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-acbd560df1a2> in <module>()
----> 1 cv.fit_transform(['q'])
~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
809 vocabulary = dict(vocabulary)
810 if not vocabulary:
--> 811 raise ValueError("empty vocabulary; perhaps the documents only"
812 " contain stop words")
813
ValueError: empty vocabulary; perhaps the documents only contain stop words
我阅读了一些关于此错误的主题,因为它似乎确实是 CV 经常引发的错误,但我发现的只是涵盖了文本实际上只包含停用词的情况。我真的不知道我的问题是什么,所以如果我能得到任何帮助,我将不胜感激!
CountVectorizer(token_pattern='(?u)\b\w\w+\b')
默认情况下仅标记包含 2 个以上字符的单词(标记)
您可以更改此默认行为:
vect = CountVectorizer(token_pattern='(?u)\b\w+\b')
测试:
In [29]: vect.fit_transform(['q'])
Out[29]:
<1x1 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [30]: vect.get_feature_names()
Out[30]: ['q']
有人能解释一下为什么当我尝试 fit_transform 任何短词时 CountVectorizer 会引发此错误吗?即使我使用 stopwords=None 我仍然会遇到同样的错误。 这是代码
from sklearn.feature_extraction.text import CountVectorizer
text = ['don\'t know when I shall return to the continuation of my scientific work. At the moment I can do absolutely nothing with it, and limit myself to the most necessary duty of my lectures; how much happier I would be to be scientifically active, if only I had the necessary mental freshness.']
cv = CountVectorizer(stop_words=None).fit(text)
并按预期工作。然后,如果我尝试 fit_transform 使用另一个文本
cv.fit_transform(['q'])
并且引发的错误是
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-acbd560df1a2> in <module>()
----> 1 cv.fit_transform(['q'])
~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
867
868 vocabulary, X = self._count_vocab(raw_documents,
--> 869 self.fixed_vocabulary_)
870
871 if self.binary:
~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
809 vocabulary = dict(vocabulary)
810 if not vocabulary:
--> 811 raise ValueError("empty vocabulary; perhaps the documents only"
812 " contain stop words")
813
ValueError: empty vocabulary; perhaps the documents only contain stop words
我阅读了一些关于此错误的主题,因为它似乎确实是 CV 经常引发的错误,但我发现的只是涵盖了文本实际上只包含停用词的情况。我真的不知道我的问题是什么,所以如果我能得到任何帮助,我将不胜感激!
CountVectorizer(token_pattern='(?u)\b\w\w+\b')
默认情况下仅标记包含 2 个以上字符的单词(标记)
您可以更改此默认行为:
vect = CountVectorizer(token_pattern='(?u)\b\w+\b')
测试:
In [29]: vect.fit_transform(['q'])
Out[29]:
<1x1 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>
In [30]: vect.get_feature_names()
Out[30]: ['q']