Pickle Tfidfvectorizer 以及自定义分词器
Pickle Tfidfvectorizer along with a custom tokenizer
我正在使用服装分词器传递给 TfidfVectorizer。该分词器依赖于另一个文件中的外部 class TermExtractor。
我基本上是想根据某些条款构建一个TfidVectorizer,而不是所有单一的words/tokens。
代码如下:
from sklearn.feature_extraction.text import TfidfVectorizer
from TermExtractor import TermExtractor
extractor = TermExtractor()
def tokenize_terms(text):
terms = extractor.extract(text)
tokens = []
for t in terms:
tokens.append('_'.join(t))
return tokens
def main():
vectorizer = TfidfVectorizer(lowercase=True, min_df=2, norm='l2', smooth_idf=True, stop_words=stop_words, tokenizer=tokenize_terms)
vectorizer.fit(corpus)
pickle.dump(vectorizer, open("models/terms_vectorizer", "wb"))
这运行良好,但每当我想重新使用此 TfidfVectorizer 并用 pickle 加载它时,我都会收到错误消息:
vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))
Traceback (most recent call last):
File "./train-nps-comments-classifier.py", line 427, in <module>
main()
File "./train-nps-comments-classifier.py", line 325, in main
vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1126, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'tokenize_terms'
Python pickle 在依赖 class 时如何工作?
弄清楚,我需要在加载 pickled TfidVectorizer 的相同代码中添加方法 tokenize_terms(),导入 TermExtractor,并创建一个提取器:
extractor = TermExtractor()
我正在使用服装分词器传递给 TfidfVectorizer。该分词器依赖于另一个文件中的外部 class TermExtractor。
我基本上是想根据某些条款构建一个TfidVectorizer,而不是所有单一的words/tokens。
代码如下:
from sklearn.feature_extraction.text import TfidfVectorizer
from TermExtractor import TermExtractor
extractor = TermExtractor()
def tokenize_terms(text):
terms = extractor.extract(text)
tokens = []
for t in terms:
tokens.append('_'.join(t))
return tokens
def main():
vectorizer = TfidfVectorizer(lowercase=True, min_df=2, norm='l2', smooth_idf=True, stop_words=stop_words, tokenizer=tokenize_terms)
vectorizer.fit(corpus)
pickle.dump(vectorizer, open("models/terms_vectorizer", "wb"))
这运行良好,但每当我想重新使用此 TfidfVectorizer 并用 pickle 加载它时,我都会收到错误消息:
vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))
Traceback (most recent call last):
File "./train-nps-comments-classifier.py", line 427, in <module>
main()
File "./train-nps-comments-classifier.py", line 325, in main
vectorizer = pickle.load(open("models/terms_vectorizer", "rb"))
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1126, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'tokenize_terms'
Python pickle 在依赖 class 时如何工作?
弄清楚,我需要在加载 pickled TfidVectorizer 的相同代码中添加方法 tokenize_terms(),导入 TermExtractor,并创建一个提取器:
extractor = TermExtractor()