在 Scikit-Learn 中使用新数据重新训练持久性 SVM 模型 (Python 3)

Retraining a Persistent SVM Model with New Data in Scikit-Learn (Python 3)

我正在 Python 中使用 Scikit-Learn 开发一个机器学习程序,该程序将根据电子邮件的内容将电子邮件分类为问题类型。例如:有人给我发邮件说 "This program is not launching",机器将其分类为 "Crash Issue"。

我正在使用 SVM 算法,该算法从 2 个 CSV 文件中读取电子邮件内容及其各自的类别标签。我写了两个程序:

  1. 第一个程序训练机器并使用joblib.dump()导出训练的模型,以便第二个程序可以使用训练的模型
  2. 第二个程序通过导入训练好的模型进行预测。我希望第二个程序能够通过使用接收到的新数据重新拟合分类器来更新经过训练的模型。但我不确定如何完成此操作。预测程序要求用户在其中输入一封电子邮件,然后它会做出预测。然后它会询问用户它的预测是否正确。在这两种情况下,我都希望机器从结果中学习。

培训计划:

import numpy as np
import pandas as pd
from pandas import DataFrame
import os
from sklearn import svm
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib


###### Extract and Vectorize the features from each email in the Training Data ######
features_file = "features.csv" #The CSV file that contains the descriptions of each email. Features will be extracted from this text data
features_df = pd.read_csv(features_file, encoding='ISO-8859-1') 
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(features_df['Description'].values.astype('U')) #The sole column in the CSV file is labeled "Description", so we specify that here


###### Encode the class Labels of the Training Data ######
labels_file = "labels.csv" #The CSV file that contains the classification labels for each email
labels_df = pd.read_csv(labels_file, encoding='ISO-8859-1')
lab_enc = preprocessing.LabelEncoder()
labels = lab_enc.fit_transform(labels_df)


###### Create a classifier and fit it to our Training Data ######
clf = svm.SVC(gamma=0.01, C=100)
clf.fit(features, labels)


###### Output persistent model files ######
joblib.dump(clf, 'brain.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(lab_enc, 'lab_enc.pkl')
print("Training completed.")

预测程序:

import numpy as np
import os
from sklearn import svm
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib


###### Load our model from our training program ######
clf = joblib.load('brain.pkl')
vectorizer = joblib.load('vectorizer.pkl')
lab_enc = joblib.load('lab_enc.pkl')


###### Prompt user for input, then make a prediction ######
print("Type an email's contents here and I will predict its category")
newData = [input(">> ")]
newDataFeatures = vectorizer.transform(newData)
print("I predict the category is: ", lab_enc.inverse_transform(clf.predict(newDataFeatures)))


###### Feedback loop - Tell the machine whether or not it was correct, and have it learn from the response ######
print("Was my prediction correct? y/n")
feedback = input(">> ")

inputValid = False
while inputValid == False: 

    if feedback == "y" or feedback == "n":
        inputValid = True
    else:
        print("Response not understood. Was my prediction correct? y/n")
        feedback = input(">> ")

if feedback == "y":
    print("I was correct. I'll incorporate this new data into my persistent model to aid in future predictions.")
    #refit the classifier using the new features and label
elif feedback == "n":
    print("I was incorrect. What was the correct category?")
    correctAnswer = input(">> ")
    print("Got it. I'll incorporate this new data into my persistent model to aid in future predictions.")
    #refit the classifier using the new features and label

根据我所做的阅读,我了解到 SVM 并不真正支持增量学习,因此我认为我需要将新数据合并到旧训练数据中并每次都从头开始重新训练整个模型我有新数据要添加到其中。这很好,但我不太确定如何着手实际实施它。我是否需要预测程序来更新两个 CSV 文件以包含新数据以便重新开始训练?

我最终发现我的问题的概念性答案是我需要更新我最初用来训练机器的 CSV 文件。收到反馈后,我简单地将新功能和标签写到各自的 CSV 文件中,然后可以使用训练数据集中包含的新信息重新训练机器。