R 中是否有 GBM 的并行实现?
Is there a parallel implementation of GBM in R?
我在 R 中使用 gbm
库,我想用我所有的 CPU 来拟合模型。
gbm.fit(x, y,
offset = NULL,
misc = NULL,...
嗯,不能 GBM 的并行实现原则上,既不在 R 中也不在任何其他实现中。原因很简单:根据定义,boosting 算法是 sequential.
考虑以下内容,引用自 The Elements of Statistical Learning,Ch。 10(提升和加法树),第 337-339 页(强调我的):
A weak classifier is one whose error rate is only slightly better than
random guessing. The purpose of boosting is to sequentially apply the
weak classification algorithm to repeatedly modified versions of the data,
thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted
majority vote to produce the final prediction.
[...]
Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.
图片中(同上,第 338 页):
事实上,这经常被认为是 GBM 相对于随机森林 (RF) 的一个主要缺点,其中树是独立的,因此可以并行安装(参见 bigrf R包)。
因此,正如上面的评论者指出的那样,您能做的最好的事情就是使用多余的 CPU 核心来并行化交叉验证过程...
对于h2o, see e.g. this blog post of theirs from 2013我引用
At 0xdata we build state-of-the-art distributed algorithms - and recently we embarked on building GBM, and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II, Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bottom of this post). Most of the algorithm is straightforward “small” math, but step 2.b.ii says “Fit a regression tree to the targets….”, i.e. fit a regression tree in the middle of the inner loop, for targets that change with each outer loop. This is where we decided to distribute/parallelize.
The platform we build on is H2O, and as talked about in the prior blog has an API focused on doing big parallel vector operations - and for GBM (and also Random Forest) we need to do big parallel tree operations. But not really any tree operation; GBM (and RF) constantly build trees - and the work is always at the leaves of a tree, and is about finding the next best split point for the subset of training data that falls into a particular leaf.
The code can be found on our git:
http://0xdata.github.io/h2o/
(编辑:回购现在位于 https://github.com/h2oai/。)
我认为另一个并行 GBM 实现在 xgboost 中。它的描述说
Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. This package is its R interface. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.
我在 R 中使用 gbm
库,我想用我所有的 CPU 来拟合模型。
gbm.fit(x, y,
offset = NULL,
misc = NULL,...
嗯,不能 GBM 的并行实现原则上,既不在 R 中也不在任何其他实现中。原因很简单:根据定义,boosting 算法是 sequential.
考虑以下内容,引用自 The Elements of Statistical Learning,Ch。 10(提升和加法树),第 337-339 页(强调我的):
A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction. [...] Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.
图片中(同上,第 338 页):
事实上,这经常被认为是 GBM 相对于随机森林 (RF) 的一个主要缺点,其中树是独立的,因此可以并行安装(参见 bigrf R包)。
因此,正如上面的评论者指出的那样,您能做的最好的事情就是使用多余的 CPU 核心来并行化交叉验证过程...
对于h2o, see e.g. this blog post of theirs from 2013我引用
At 0xdata we build state-of-the-art distributed algorithms - and recently we embarked on building GBM, and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II, Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bottom of this post). Most of the algorithm is straightforward “small” math, but step 2.b.ii says “Fit a regression tree to the targets….”, i.e. fit a regression tree in the middle of the inner loop, for targets that change with each outer loop. This is where we decided to distribute/parallelize.
The platform we build on is H2O, and as talked about in the prior blog has an API focused on doing big parallel vector operations - and for GBM (and also Random Forest) we need to do big parallel tree operations. But not really any tree operation; GBM (and RF) constantly build trees - and the work is always at the leaves of a tree, and is about finding the next best split point for the subset of training data that falls into a particular leaf.
The code can be found on our git: http://0xdata.github.io/h2o/
(编辑:回购现在位于 https://github.com/h2oai/。)
我认为另一个并行 GBM 实现在 xgboost 中。它的描述说
Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. This package is its R interface. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.