在 python 中构建机器学习模型后检查数据

Examine data after building Machine Learning model in python

我用阿拉伯语构建了一个情感分析模型;在 Python 中;构建模型后,如何使用外部数据对其进行测试以及如何为其构建代码?

我在拟合模型的时候,通过tf-idf抽取了特征,遇到的问题是在训练和测试数据上训练模型后,想测试外部数据时如何处理。

总结: 在我训练模型并达到 88% 的准确率后,我想构建一个使用外部数据测试模型的代码..

# train = 3461 record  
# test = 61 record \
# combi = train + test - to apply tf-idf

train = pd.read_excel('Final_train.xlsx')

test = pd.read_excel('Testing.xlsx' , usecols=['Tweet'])


# merge train & test to apply all function on it 
def combine(tr,te):
   global combi 
   combi = tr.append(te , ignore_index=True)


# this is script to removing all stop words based on NLTK ( Natural Language ToolKit )

def remove_stop(combi1):
    combi1['Tweet'] = combi1['Tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    return combi1


#
# this is script to returns Arabic root for the given token Provided by University of Nevada, Las Vegas, USA.

def steeming(combi2):
    st = ISRIStemmer()
    combi2['Tweet'] = combi2['Tweet'].apply(lambda x: " ".join([st.stem(word) for word in x.split()])) 
    return combi2


# This is a List contain many word not Related to our domain we need to remove it from our Dataset
# to make ML Model Work properly & Accurate __ I built it Manually in order to to develop accuracy of model 

def remove_unneded_word(combi3):
    unword =  pd.read_excel('Final_train & MCSA/Un_neededword.xlsx')
    unword = unword.squeeze()
    unword = list(unword)
    combi3['Tweet'] = combi3['Tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in unword))
    return combi3

# Calling the Each function Alone 
combine(train,test)
combi = remove_stop(combi)
combi = steeming(combi)
combi = remove_unneded_word(combi)



#                  ______________________________________________
#                 | Term Frequency–Inverse Document Frequency    |
#
#   To Represent Each word as matrix of numbers 

tfidf_vectorizer = TfidfVectorizer(max_df=0.8,min_df=5, max_features=1600)
# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(combi['Tweet'])




from sklearn import svm #Import scikit-learn  to apply support victor machine Algorithm 
from sklearn.model_selection import train_test_split
from sklearn import metrics


train_bow = tfidf[:3641,:]
test_bow = tfidf[3641:,:]

# splitting data into training and validation set
Tr_D_bow, Te_D_bow, Tr_L_bow , Te_L_bow = train_test_split(train_bow, train['Class'], random_state=45, test_size=0.2)
# Create SVM classifer object
SVM = svm.SVC()
# Train SVM Classifer
SVM = SVM.fit(Tr_D_bow,Tr_L_bow)

#Predict the response for test dataset
SVM_pre = SVM.predict(Te_D_bow)



print('The Accuracy of SVMC is -->',metrics.accuracy_score(Te_L_bow, SVM_pre))

为了测试模型(使用模型对看不见的数据进行预测),您应该使用 .predict.transform 函数(使用 sklearn)。

在您的代码中,您将预处理函数和模型训练函数分开了,这很好。但是将测试和训练数据结合起来并不好。测试数据应该是你的“外部数据”。 您还可以将 TfIdf 作为预处理步骤应用于测试和训练数据,但 tfidf_vectorizer.transform 应该只应用于训练数据! 如果您将所有数据都放在 TfIdf 中,那么您将不知道模型在没有看到输入的某些单词时会如何表现。

使用 sklearn 我通常按如下方式组织我的代码:

# Read data
train = pd.read_excel('Final_train.xlsx')

# Split data in test, valid, train
x_train, x_valid, y_train, y_valid = train_test_split(
    train, train['Class'], random_state=45, test_size=0.2)

# Define preprocessing functions
def preprocess(data):
    data = remove_stop(data)
    data = stemming(data)
    data = remove_unneeded_word(data)
    return data

# Define sklearn pipeline
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8,min_df=5, max_features=1600)
classifier = svm.SVC()

# Train classifier
x_train_preproc = preprocess(train)
x_train_bow = tfidf_vectorizer.fit_transform(x_train_preproc)
classifier.fit(x_train_preproc, y_train)

# Test pipeline
x_valid_preproc = preprocess(valid)
x_valid_bow = tfidf_vectorizer.transform(x_valid_preproc)
pred_valid = classifier.predict(x_valid_bow)
print('The Accuracy of SVMC is -->',
      metrics.accuracy_score(y_valid, pred_valid))

# Save model
# from https://medium.datadriveninvestor.com/machine-learning-how-to-save-and-load-scikit-learn-models-d7b99bc32c27
with open('trained_tfidf.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)
with open('trained_classifier.pkl', 'wb') as f:
    pickle.dump(classifier, f)

# Load model
with open('trained_tfidf.pkl', 'wb') as f:
    tfidf_vectorizer = pickle.load(f)
with open('trained_classifier.pkl', 'rb') as f:
    classifier = pickle.load(f)

# Predict on unseen data
# Note how the test data have not been loaded until now !
test = pd.read_excel('Testing.xlsx', usecols=['Tweet'])
x_test_preproc = preprocess(test)
x_test_bow = tfidf_vectorizer.transform(x_test_preproc)
pred_test = classifier.predict(x_test_bow)
print(pred_test)

主要目标是确保仅在训练完所有内容后才使用测试数据来评估模型对未见数据的泛化。 为了更容易,您可以将 TfIdfVectorizer 和 SVC 组合成 pipeline。 请注意,我编写的代码没有经过测试,这里只是为了展示一般步骤。