保留使用 TFIDF 制作的模型,以使用 Scikit 为 Python 预测新内容
Keep model made with TFIDF for predicting new content using Scikit for Python
这是一个用tf-idf做的情感分析模型,用于特征提取
我想知道如何保存这个模型并重新使用它。
我尝试以这种方式保存它,但是当我加载它时,对测试文本进行相同的预处理并对其进行 fit_transform 它给出了一个错误,即模型期望 X 个特征但得到 Y
我就是这样保存的
filename = "model.joblib"
joblib.dump(model, filename)
这是我的 tf-idf 模型的代码
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
编辑:
只是为了准确地放置每一行
所以之后:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
然后
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
然后你完成剩下的步骤,你将在训练后保存分类器,所以在:
text_classifier.fit(X_train, y_train)
放
joblib.dump(型号, "classifier.joblib")
现在当你想预测任何文本时
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
现在你有要预测的句子列表
sent = []
classifier.predict(tf_idf_converter.transform(sent))
现在打印每个句子的情绪列表
您可以先使用以下方法将 tfidf
拟合到您的训练集:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
然后找到一种存储 tfidf_obj
的方法,例如使用 pickle
或 joblib
例如:
joblib.dump(tfidf_obj, filename)
然后加载保存的 tfidf_obj
并仅在您的测试集上应用 transform
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)
这是一个用tf-idf做的情感分析模型,用于特征提取 我想知道如何保存这个模型并重新使用它。 我尝试以这种方式保存它,但是当我加载它时,对测试文本进行相同的预处理并对其进行 fit_transform 它给出了一个错误,即模型期望 X 个特征但得到 Y
我就是这样保存的
filename = "model.joblib"
joblib.dump(model, filename)
这是我的 tf-idf 模型的代码
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
编辑: 只是为了准确地放置每一行 所以之后:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
然后
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
然后你完成剩下的步骤,你将在训练后保存分类器,所以在:
text_classifier.fit(X_train, y_train)
放 joblib.dump(型号, "classifier.joblib") 现在当你想预测任何文本时
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
现在你有要预测的句子列表
sent = []
classifier.predict(tf_idf_converter.transform(sent))
现在打印每个句子的情绪列表
您可以先使用以下方法将 tfidf
拟合到您的训练集:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
然后找到一种存储 tfidf_obj
的方法,例如使用 pickle
或 joblib
例如:
joblib.dump(tfidf_obj, filename)
然后加载保存的 tfidf_obj
并仅在您的测试集上应用 transform
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)