随机森林的简单解释

A simple explanation of Random Forest

我试图用简单的英语而不是数学来理解随机森林是如何工作的。谁能给我一个关于这个算法如何工作的非常简单的解释?

据我了解,我们在没有告诉算法哪个特征应该归类为哪个标签的情况下提供特征和标签?就像我以前做基于概率的朴素贝叶斯一样,我们需要判断哪个特征应该是哪个标签。我离得很远吗?

如果我能得到任何非常简单的解释,我将不胜感激。

RandomForest 使用所谓的套袋方法。这个想法基于经典的偏差方差权衡。假设我们有一组(比如 N)的过拟合估计量,它们具有低偏差但高交叉样本方差。所以低偏差是好的,我们想保持它,高方差是坏的,我们想减少它。 RandomForest 试图通过所谓的 bootstraps/sub-sampling 来实现这一点(正如@Alexander 提到的,这是对观察和特征进行 bootstrap 采样的组合)。预测是单个估计量的平均值,因此成功保留了低偏差 属性。进一步根据中心极限定理,这个样本平均值的方差等于variance of individual estimator divided by square root of N。所以现在,它具有低偏差和低方差特性,这就是为什么 RandomForest 通常优于独立估计器的原因。

我会尽量用简单的文字来补充说明。

随机森林是随机决策树的集合(在 sklearn 中数量为 n_estimators)。 您需要了解的是如何构建一棵随机决策树。

粗略地说,要构建随机决策树,您需要从训练样本的子集开始。在每个节点,您将随机绘制一个特征子集(数量由 sklearn 中的 max_features 确定)。对于这些功能中的每一个,您将测试不同的阈值,并查看它们如何根据给定标准(通常是熵或基尼系数,sklearn 中的 criterion 参数)分割您的样本。然后,您将保留最能分割数据的特征及其阈值,并将其记录在节点中。 当树的构建结束时(可能出于不同的原因:达到最大深度(max_depth in sklearn),达到最小样本数(min_samples_leaf in sklearn)等)你看在每片叶子中采样并保持标签的频率。 结果,这就像树根据有意义的特征为您划分了训练样本。

由于每个节点都是根据随机获取的特征构建的,因此您了解以这种方式构建的每棵树都会有所不同。正如@Jianxun Li 所解释的那样,这有助于在偏差和方差之间取得良好的折衷。

然后在测试模式下,测试样本将遍历每棵树,为您提供每棵树的标签频率。代表最多的标签一般就是最终的分类结果。

加上上面两个答案,既然你说了一个简单的解释。这是一篇我认为是解释随机森林的最简单方法的文章。

致谢 Edwin Chen 以通俗易懂的方式对随机森林进行了简单解释 here。在下面发布相同的内容。

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.

And so your friends now form a random forest.