NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysis

NotFittedError: CountVectorizer - Vocabulary wasn't fitted. while performing sentiment analysis

同时使用数据进行情绪分析 -

http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

数据集包含 25K 训练和测试数据(12.5 条正面评论和 12.5 条负面评论) 我不断得到 -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

代码-

(需要的库和变量名分别初始化)

创建训练和测试数据-

import glob
import os
import numpy as np
def load_texts_labels_from_folders(path, folders):
    texts,labels = [],[]
    for idx,label in enumerate(folders):
        for fname in glob.glob(os.path.join(path, label, '*.*')):
            texts.append(open(fname, 'r',encoding="utf8").read())
            labels.append(idx)
    # stored as np.int8 to save space 
    return texts, np.array(labels).astype(np.int8)

trn,trn_y = load_texts_labels_from_folders(f'{PATH}train',names)
val,val_y = load_texts_labels_from_folders(f'{PATH}test',names)

len(trn),len(trn_y),len(val),len(val_y)

len(trn_y[trn_y==1]),len(val_y[val_y==1])

np.unique(trn_y)

计数矢量化 -

re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r'  ', s).split()

#create term documetn matrix
veczr = CountVectorizer(tokenizer=tokenize)


trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

veczr = CountVectorizer(tokenizer=tokenize,ngram_range=(1,3), min_df=1,max_features=80000)
trn_term_doc
trn_term_doc[5] #83 stored elements
w0 = set([o.lower() for o in trn[5].split(' ')]); w0
len(w0)
vocab = loaded_vectorizer.get_feature_names()
print(len(vocab))
vocab[5000:5005]

这里我得到错误 -

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
vocab = loaded_vectorizer.get_feature_names()

loaded_vectorizer在这段代码的任何地方都没有定义,所以它没有被初始化也就不足为奇了。

另外为什么要初始化veczr两次?看来你不会用第二次了。