训练和评估预测模型的预测误差

Prediction Error on training and evaluating prediction models

我正在尝试使用我在 Kaggle 上找到的数据集来训练和评估预测模型,但我的精度为 0,我想知道我是否做错了什么

该代码适用于随机森林模型,但不适用于 SVM 或神经网络

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

#loading dataset
recipes = pd.read_csv('epi_r.csv')

keep_col = ['rating','calories','protein','fat','sodium']
recipes = recipes[keep_col]
recipes = recipes.dropna()

#preprocessing data
bins = (-1, 4, 5)
group_names = ['bad','good']
recipes['rating'] = pd.cut(recipes['rating'].dropna(), bins = bins,           labels = group_names)
recipes['rating'].unique()

#bad=0; good=1
label_rating = LabelEncoder()

recipes['rating'] =        label_rating.fit_transform(recipes['rating'].astype(str))

#separate dataset as response variable and feature variables
x = recipes.drop('rating', axis=1)
y = recipes['rating']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size   = 0.20, random_state = 42)

#converts the values & levels the playing fields
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
#don't fit again b/c want to use the same fit
x_test = sc.transform(x_test)

clf=svm.SVC()
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)

print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))



precision    recall  f1-score   support

       0       0.00      0.00      0.00      1465
       1       0.54      1.00      0.70      1708

   micro avg       0.54      0.54      0.54      3173
   macro avg       0.27      0.50      0.35      3173
weighted avg       0.29      0.54      0.38      3173

[[   0 1465]
 [   0 1708]]

/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py:1143: UndefinedMetricWarning:    Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)

这是我得到的结果,没有任何预测是正确的

Class 1 的召回率为 1.0,这意味着您的模型始终预测“1”。 您还可以从混淆矩阵中看到这一点,其中 class 1 的 1708 个值被正确预测,但是 class 0 的 1465 个值被预测为 class 1.

总是预测单个值的模型是一个常见问题——它陷入了某种 sub-optimal 解决方案。您可能会发现对输入值进行归一化(因此一列不会占主导地位)、使用不同类型的模型(例如不同的内核)甚至选择不同的随机种子。

你只是没有找到合适的参数。例如,在您的情况下,您过度拟合了。您应该尝试使用 GridSearchCV 为您的数据集找到最佳参数(尤其是内核、C 和伽玛)。

我试了一下你的数据集并尝试了以下更改,

clf=SVC(kernel='sigmoid', C=10, verbose=True)
clf.fit(x_train,y_train)
pred_clf = clf.predict(x_test)
print(pred_clf)
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))

输出:

......
Warning: using -h 0 may be faster
*
optimization finished, #iter = 6651
obj = -196704.694272, rho = 33.691873
nSV = 9068, nBSV = 9068
Total nSV = 9068
[LibSVM][0 1 1 ... 0 1 0]
              precision    recall  f1-score   support

           0       0.49      0.58      0.53      1465
           1       0.58      0.49      0.53      1708

    accuracy                           0.53      3173
   macro avg       0.53      0.53      0.53      3173
weighted avg       0.54      0.53      0.53      3173

[[843 622]
 [864 844]]

结果不是很好,但也不是全部。

总而言之,执行以下操作:

  1. 始终尝试cross-validation为您的数据集找到一组好的参数
  2. 打开估算器的详细选项。这为您提供了有关正在发生的事情的重要线索
  3. 始终先尝试可视化并使用更简单的算法,例如我可能会尝试了解数据是否可线性分离,尝试逻辑回归,然后才尝试 SVM 或集成之类的方法。这些总是更难调整