减少 Pickle 大小 TfidfVectorizer

Question

我需要标准化一些参数来构建基于文本的向量。这就是为什么我试图从一组文本文档中挑选一个 TfidVectorizer。基于这些参数，我需要对新的文本文档进行矢量化，它们的特征和权重标准应该与以前的文档相同。

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
        strip_accents = 'ascii', sublinear_tf=True, min_df=5, norm='l2',
        encoding='latin-1', ngram_range=(1, 2), stop_words=spanish_stopwords,
        token_pattern = r'\w+[a-z,ñ]')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()

features.shape

(617, 22997)

import pickle
pickle.dump(tfidf, open("vectorizer3.pickle", "wb"))

vectorizer3.pickle 大小为 76.2MB。有没有办法将其减少到 10MB？

Answer 1

尝试使用 gzip

import gzip
import pickle

# writing into file. This will take long time
fp = gzip.open('tfidf.data','wb')
pickle.dump(tfidf,fp)
fp.close()

# read the file
fp = gzip.open('primes.data','rb') #This assumes that tfidf.data is already packed with gzip
tfidf = pickle.load(fp)
fp.close()

此方法可能无法保证您将文件大小减小到 < 10MB。但肯定的是，它会减少 pickle 文件的大小

减少 Pickle 大小 TfidfVectorizer

Reduce Pickle size TfidfVectorizer

size

pickle

python-3.x

sklearn-pandas

tfidfvectorizer