h2o DRF 中权重列的影响

Question

h2o documentation 声明 weights_column 选项

This option specifies the column in a training frame to be used when determining weights. Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are also supported. During training, rows with higher weights matter more, due to the larger loss function pre-factor.

我对权重列对 DRF（随机森林）回归树的影响特别感兴趣。我发现描述 "This is typically the number of times a row is repeated" 令人困惑。虽然他们说框架的大小实际上没有增加，但它暗示具有 higher/lower 权重的行得到 over/under-sampled（当根据 sample_rate 选择每树训练数据时）。然而，查看github上的h2o源代码，似乎并非如此。使用权重的代码的相关部分在 DHistogram.java 和 read

中

double wy = weight * y;
double wyy = wy * y;  // This is the correct implementation.
int b = bin(col_data);
_vals[3*b + 0] += weight;
_vals[3*b + 1] += wy;
_vals[3*b + 2] += wyy;

这表明权重仅用于计算加权行数（_vals[3*b + 0]）和加权误差平方和（通过 _vals[3*b + 1] 和 _vals[3*b + 2]，参见 DTree.java）。

此外，我在R中做了一些不同权重的测试。我训练了不同的 DRF 模型，每个模型在所有观察中都具有均匀的权重，但在模型中具有不同的权重大小。我对权重仅用于加权行数和加权平方误差的怀疑似乎得到证实。

library(h2o)

h2o.init()

#different weights for each model
iris$weight0=0.5
iris$weight1=1
iris$weight2=2
irisH=as.h2o(iris)
predNames=setdiff(colnames(irisH),c("Sepal.Length","weight2","weight1","weight0"))
exludeLinesRegex="(.*DRF_model_R_.*)|(.*AUTOGENERATED.*)|(.*UUID.*)|(.*weight.*)"
pojoList=list()

#train 3 models, each with different weights magnitude
for (i in 0:2) {
    weightColName=paste0("weight",i)
    tmpRf=h2o.randomForest(y="Sepal.Length",
                           x=predNames,
                           training_frame = irisH,
                           seed = 1234,
                           ntrees = 10,
                           #min_rows has to be adjusted-it refers to weighted rows
                           min_rows= 20*irisH[1,weightColName],
                           max_depth = 3,
                           mtries = 4,
                           weights_column = weightColName)
    tmpPojo=capture.output(h2o.download_pojo(tmpRf))
    pojoList[[length(pojoList)+1]]=tmpPojo[!grepl(exludeLinesRegex,tmpPojo)]
}

h2o.shutdown(FALSE)

# all forests are the same
length(unique(pojoList))
# 1

正如上面所见，尽管权重大小不同，但所有 3 个森林都是相同的。唯一需要做的调整是 min_rows 因为它指的是加权行号。如果行真的会 over/undersampled，我希望看到模型之间的（小）差异。

因此我的问题是：

权重是否用于计算加权行数和误差平方和的其他地方？
回归 DRF 模型在权重的均匀缩放下是否通常不变，即，如果我将权重列乘以标量 a>0 并相应地调整 min_rows，请执行模型保持原样？（如上面的 R 代码示例所示。）
如果是，这是否也适用于具有分类树和 GBM 模型的森林？

感谢您的帮助！

Answer 1

从概念上讲，权重可以指示哪些行对正确处理很重要，或者指示要复制或压缩哪些行。然而，包括weights_column，并不会改变实际数据集的大小；它只影响 DRF 的数学计算。

注意：无论您是解决分类问题还是回归问题，H2O 都会训练回归树。

项目符号详细信息

weights_column不影响采样率。
将权重乘以一个因子不会改变结果（即全 1 的权重列与全 2 的权重列相同）
权重在几个地方使用这里有几个例子：
- 每棵树中的第一个节点
- 损失函数，用于决定在每个内部节点分割哪个特征。
- 终端节点。
- 所有性能指标计算。

h2o DRF 中权重列的影响

Effect of weights column in h2o DRF

random-forest

h2o