TfidfVectorizer: ValueError: not a built-in stop list: russian
TfidfVectorizer: ValueError: not a built-in stop list: russian
我尝试使用俄语停用词应用 TfidfVectorizer
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
Z = Tfidf.fit_transform(X)
我得到
ValueError: not a built-in stop list: russian
当我使用正确的英语停用词时
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english' )
Z = Tfidf.fit_transform(X)
如何改进?
完整追溯
<ipython-input-118-e787bf15d612> in <module>()
1 Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
----> 2 Z = Tfidf.fit_transform(X)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
815
816 vocabulary, X = self._count_vocab(raw_documents,
--> 817 self.fixed_vocabulary_)
818
819 if self.binary:
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
745 vocabulary.default_factory = vocabulary.__len__
746
--> 747 analyze = self.build_analyzer()
748 j_indices = _make_int_array()
749 indptr = _make_int_array()
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
232
233 elif self.analyzer == 'word':
--> 234 stop_words = self.get_stop_words()
235 tokenize = self.build_tokenizer()
236
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_stop_words(self)
215 def get_stop_words(self):
216 """Build or fetch the effective stop words list"""
--> 217 return _check_stop_list(self.stop_words)
218
219 def build_analyzer(self):
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_stop_list(stop)
88 return ENGLISH_STOP_WORDS
89 elif isinstance(stop, six.string_types):
---> 90 raise ValueError("not a built-in stop list: %s" % stop)
91 elif stop is None:
92 return None
ValueError: not a built-in stop list: russian
你们发帖前能先阅读documentation吗?
stop_words : string {‘english’}, list, or None (default)
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported
string value.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words
based on intra corpus document frequency of terms.
有几个用于删除停用词的库支持更多语言,例如 stop-words or snowballStemmer。
导入以下代码
from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(["russian"])
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='my_stop_words' )
Z = Tfidf.fit_transform(X)
我尝试使用俄语停用词应用 TfidfVectorizer
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
Z = Tfidf.fit_transform(X)
我得到
ValueError: not a built-in stop list: russian
当我使用正确的英语停用词时
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='english' )
Z = Tfidf.fit_transform(X)
如何改进? 完整追溯
<ipython-input-118-e787bf15d612> in <module>()
1 Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='russian' )
----> 2 Z = Tfidf.fit_transform(X)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1303 Tf-idf-weighted document-term matrix.
1304 """
-> 1305 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1306 self._tfidf.fit(X)
1307 # X is already a transformed view of raw_documents so
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
815
816 vocabulary, X = self._count_vocab(raw_documents,
--> 817 self.fixed_vocabulary_)
818
819 if self.binary:
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
745 vocabulary.default_factory = vocabulary.__len__
746
--> 747 analyze = self.build_analyzer()
748 j_indices = _make_int_array()
749 indptr = _make_int_array()
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in build_analyzer(self)
232
233 elif self.analyzer == 'word':
--> 234 stop_words = self.get_stop_words()
235 tokenize = self.build_tokenizer()
236
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in get_stop_words(self)
215 def get_stop_words(self):
216 """Build or fetch the effective stop words list"""
--> 217 return _check_stop_list(self.stop_words)
218
219 def build_analyzer(self):
C:\Program Files\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _check_stop_list(stop)
88 return ENGLISH_STOP_WORDS
89 elif isinstance(stop, six.string_types):
---> 90 raise ValueError("not a built-in stop list: %s" % stop)
91 elif stop is None:
92 return None
ValueError: not a built-in stop list: russian
你们发帖前能先阅读documentation吗?
stop_words : string {‘english’}, list, or None (default)
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
有几个用于删除停用词的库支持更多语言,例如 stop-words or snowballStemmer。
导入以下代码
from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(["russian"])
Tfidf = sklearn.feature_extraction.text.TfidfVectorizer(stop_words='my_stop_words' )
Z = Tfidf.fit_transform(X)