不使用 train_test_split 方法的 sklearn SVM 的数据 pre-processing

Question

我使用 Inception 为大约 11000 个视频生成了 1000 个特征（objects 的概率）。这些视频已经按流派分类，我希望 SVM 预测视频属于哪种流派。

我想将 SVM 应用于这些特征向量，但到目前为止我读过的每个教程都使用了 sklearn.model_selection 中的 train_test_split 方法。

我的数据看起来如何：

我已将我的数据集分成两个 csv 文件，其中包含约 9000 个训练和约 2000 个测试（每个都有 1000 个特征）记录。格式为videoId,feature1,feature2,...,feature1000
我有标题为流派的文件，例如Training/education.txt 用于训练，Testing/education.txt 用于测试。每个文件包含属于该类型的 videoId 个。

我是数据科学和 pandas、sklearn 等图书馆的新手，所以我不知道应该如何准备这些数据。我一直在关注 this guide:

import pandas as pd  

bankdata = pd.read_csv("D:/Datasets/bill_authentication.csv")  
X = bankdata.drop('Class', axis=1)  
y = bankdata['Class']  
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

我将如何获得 X_train、X_test、y_train、y_test？由于目前我的数据设置方式，我无法使用 train_test_split 等方法。

Answer 1

所有教程都建议您使用 sklearn.model.selection 中的 train_test_split 的原因是因为它们假设您想要评估学习模型的性能，并可能在最终之前调整其超参数用它来生成对测试数据集的预测。

这种做法被称为预留“cross-validation”组。为此，您暂时保持测试集不变，实际上拆分了训练集中大约 20% 的行。您在训练集中 80% 的行上训练模型，并使用该模型对剩余 20% 的训练集生成预测。

您可以选择一个指标，例如模型的 accuracy, to judge the performance of your model. Oftentimes, it's at this point that you will want to experiment with trying out different values for the hyperparameters，并查看其在验证集（最后 20% 的训练集）上的得分是否有所提高。

train_test_split 方法只是一种将训练数据分成这 80/20 部分的简单方法。我建议您不要跳过这一步。原因是，如果您在观察模型在 实际测试集 上的表现后更改模型或其超参数，您将失去了解模型在品牌上表现如何的任何基础新的，real-world 数据。

这被称为“对测试集的过度拟合”，这是导致机器学习模型在一组 previously-collected 数据上表现非常好的实践的常见错误，然而（令他们的创造者）最终在这些模型最终投入生产时看到的真实数据上表现得非常糟糕。

总而言之，您的想法是：

训练 80% 的训练数据。
评估 20% 的训练数据。
更改您的模型，直到您满意它在步骤 (2.) 中使用的数据上的得分。
最后，仅在最后，使用您的模型对您的实际测试数据进行预测。

顺便说一句，Sklearn 对方法的命名 train_test_split 有点令人困惑，因为该方法的目的是创建验证集。（train_val_split 在我看来是一个更直观的名字...）

以下是代码中的步骤，我想您会根据自己的特定情况（数据拆分为多个 .txt 文件）执行以下操作：

导入模块和所有训练 .csv 文件：

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X_edu = pd.read_csv('Training/education.txt')
X_hor = pd.read_csv('Training/horror.txt')
...

在每个流派的数据框中创建一个 Genre 列，然后将所有这些连接到一个数据框中：

train_dfs = [X_edu, X_hor, ...]
genres = ['edu', 'hor', ...]
for i, genre in enumerate(genres):
    train_dfs[i]['Genre'] = genre

X = pd.concat[train_dfs].reset_index(drop=True) # reset the index so each row has a unique index
                                                # important so we can be sure we can properly match each row with its label

从训练数据中提取标签（我假设标签位于标题为 Genre 或类似内容的列中）并删除 videoID 列（因为它看起来不像预测功能）：

y = X['Genre']
X = X.drop(['Genre', 'videoID'], axis=1)

使用 train_test_split 创建您的训练和验证集（不错的奖励：train_test_split 在拆分之前自动打乱整个训练数据帧的行，因此您不必担心一些流派不在您的验证集中）：

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20)

使您的模型适合 X_train 并对 X_val 进行预测：

clf = SVC()
clf.fit(X_train, y_train)
preds = clf.predict(X_val)

确定你的模型在验证集上做出的这些预测的表现（我在这里使用准确性，但你可以使用你想要的任何指标——Sklearn 可能有一个 class 用于你想使用的任何指标.)

val_acc = accuracy_score(y_val, preds)

为您的 SVM 学习者的 hyperparameters 使用不同的值进行实验，然后重复上面的步骤 (5.) 和 (6.)。当您对模型的性能感到满意时，现在是时候开始对您的实际测试数据生成预测了。

您可以加载每种类型的测试 .csv 文件，然后 assemble 将它们一起加载到一个数据框中，就像您对上述训练数据所做的那样：

test_edu = pd.read_csv('Training/education.txt')
test_hor = pd.read_csv('Training/horror.txt')
...

test_dfs = [test_edu, test_hor, ...]
for i, genre in enumerate(genres):
    test_dfs[i]['Genre'] = genre

test = pd.concat[test_dfs].reset_index(drop=True) # reset the index so each row has a unique index
y_test = test['Genre']
X_test = test.drop(['Genre', 'videoID'], axis=1)
test_preds = clf.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)

这个测试集准确度分数应该可以让您对模型的运行情况做出最真实的估计，如果要求它对以前从未见过的全新视频进行预测。

不使用 train_test_split 方法的 sklearn SVM 的数据 pre-processing

Data pre-processing for sklearn's SVM without using the train_test_split method

svm

pandas

scikit-learn