char_wb 的 Tf-idf 会忽略自定义预处理器?
Tf-idf with char_wb ignores custom preporcessor?
我有
import nltk
from nltk.stem.snowball import GermanStemmer
def my_tokenizer(doc):
stemmer= GermanStemmer()
return([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words])
text="hallo df sdfd"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
从文档中可以清楚地看出,自定义 toenizer 仅适用于 analyzer=word。
我明白了
Traceback (most recent call last):
File "TfidF.py", line 95, in <module>
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
File "C:\Users\chris1\Anaconda3\envs\master\lib\site-packages\sklearn\feature_extraction\text.py", line 185, in _char_wb_ngrams
text_document = self._white_spaces.sub(" ", text_document)
TypeError: expected string or bytes-like object
你必须加入单词,然后 return 一个字符串。
试试这个!
return(' '.join ([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words]))
我有
import nltk
from nltk.stem.snowball import GermanStemmer
def my_tokenizer(doc):
stemmer= GermanStemmer()
return([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words])
text="hallo df sdfd"
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
从文档中可以清楚地看出,自定义 toenizer 仅适用于 analyzer=word。
我明白了
Traceback (most recent call last):
File "TfidF.py", line 95, in <module>
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=(4,6),preprocessor=my_tokenizer, max_features=50).fit([str(text)])
File "C:\Users\chris1\Anaconda3\envs\master\lib\site-packages\sklearn\feature_extraction\text.py", line 185, in _char_wb_ngrams
text_document = self._white_spaces.sub(" ", text_document)
TypeError: expected string or bytes-like object
你必须加入单词,然后 return 一个字符串。 试试这个!
return(' '.join ([stemmer.stem(t.lower()) for t in nltk.word_tokenize(doc) if
t.lower() not in my_stop_words]))