关于机器学习中集成技术的问题

Questions on ensemble technique in machine learning

我正在学习集成机器学习，在网上看一些文章的时候遇到了2个问题。

在这个article中提到了

Instead, model 2 may have a better overall performance on all the data points, but it has worse performance on the very set of points where model 1 is better. The idea is to combine these two models where they perform the best. This is why creating out-of-sample predictions have a higher chance of capturing distinct regions where each model performs the best.

但是我还是没明白，为什么不训练所有的训练数据就可以避免这个问题呢？

由此article，在预测部分，提到

Simply, for a given input data point, all we need to do is to pass it through the M base-learners and get M number of predictions, and send those M predictions through the meta-learner as inputs

但是在训练过程中，我们使用了k-fold train data来训练M个base-learner，那么我是不是也应该根据所有的train data来训练M个base-learner来进行预测呢？

集成中的想法是一组弱预测器优于强预测器。因此，如果我们训练具有不同预测结果的不同模型，并使用多数规则作为我们集成的最终结果，这个结果比只尝试训练一个单一模型要好。例如，假设数据由两种不同的模式组成，一种是线性模式，一种是二次模式。然后使用单个分类器可能会过度拟合或产生不准确的结果。您可以阅读 this tutorial 以了解有关集成、装袋和提升的更多信息。

1) "But I still cannot get the point, why not train all training data can avoid the problem?" - 我们将保留该数据用于验证目的，就像我们在 K-fold

中所做的那样

2) "so should I also train M base-learner based on all train data for the input to predict?" - 如果您将相同的数据提供给所有学习者，那么所有学习者的输出都将相同，并且创建它们没有用。所以我们会给每个学习者一个数据子集。

假设红色和蓝色是您能找到的最佳模型。

一个在区域 1 中效果更好，另一个在区域 2 中效果更好。

现在您还可以训练分类器来预测要使用的模型，即您将尝试学习这两个区域。

在外面做验证。如果您让两个内部模型访问元模型看不到的数据，则可能会过度拟合。

对于问题1，我将证明为什么我们以相反的方式训练两个模型。假设你用所有的数据 points.During 训练一个模型，每当模型看到属于红色的数据点 class 时，它就会尝试适应自己，以便它可以 classify具有最小 error.Same 的红色点对于属于蓝色 class.Therefore 的数据点在训练过程中是正确的，模型倾向于特定的数据点（红色或蓝色）。最后模型将尝试拟合本身，这样它就不会在数据点上犯太多错误，最终模型将是一个平均模型。但是，如果您为两个不同的数据集训练两个模型，那么每个模型都将在特定的数据集上进行训练，并且模型不必关心属于另一个 class.

的数据点

用下面的比喻会更清楚。假设有两个人专门做两个完全不同的 jobs.Now 当工作来了，如果你告诉他们你们两个都必须做这份工作，他们每个人都需要做 50% 的工作。现在想想你最终会得到什么样的结果。现在也想想如果你告诉他们一个人应该只从事他最擅长的工作，结果会是什么。

在问题 2 中，您必须在训练期间将训练数据集拆分为 M datasets.And，将 M 个数据集提供给 M 个基学习器。

关于机器学习中集成技术的问题

Questions on ensemble technique in machine learning

python

machine-learning

data-mining

scikit-learn

ensemble-learning