通过两次拟合提升树来获得不同的值

Question

我使用 R 包 adabag 将增强树拟合到（大型）数据集（140 个观测值和 3 845 个预测变量）。

I executed this method twice with same parameter and same data set and each time different values of the accuracy returned (I defined a simple function which gives accuracy given a data set). Did I make a mistake or is usual that in each fitting different values of the accuracy return? Is this problem based on the fact that the data set is large?

给定预测值和真实测试集值 returns 准确度的函数。

    err<-function(pred_d, test_d)
{
  abs.acc<-sum(pred_d==test_d)
  rel.acc<-abs.acc/length(test_d)

  v<-c(abs.acc,rel.acc)

  return(v)
}

新编辑 (9.1.2017)：上述上下文的重要后续问题。

As far as I can see I do not use any "pseudo randomness objects" (such as generating random numbers etc.) in my code, because I essentially fit trees (using r-package rpart) and boosted trees (using r-package adabag) to a large data set. Can you explain me where "pseudo randomness" enters, when I execute my code?

编辑 1：类似的现象也发生在树上（使用 R 包 rpart）。

编辑 2：数据集 iris 上的树（使用 rpart）没有发生类似现象。

Answer 1

如果您没有设置种子（set.seed()），您没有理由期望获得相同的结果。

如果您做的是统计而不是信息安全，那么设置什么种子并不重要。您可能运行您的模型有几个不同的种子来检查它的敏感性。你只需要在任何涉及伪随机性的事情之前设置它。大多数人将其设置在代码的开头。

这在统计中无处不在；它影响所有语言的所有概率模型和过程。

请注意，在信息安全的情况下，有一个（伪）随机种子很重要，它不容易被暴力攻击猜到，因为（简而言之）知道安全程序内部使用的种子值铺平了道路它被黑客攻击的方式。在科学和统计学中，情况恰恰相反——你和你分享你的 code/research 的任何人都应该知道种子以确保可重复性。

https://en.wikipedia.org/wiki/Random_seed

http://www.grasshopper3d.com/forum/topics/what-are-random-seed-values

通过两次拟合提升树来获得不同的值

different values by fitting a boosted tree twice

r

machine-learning

boosting