在 R 程序的随机生存林中有 RMSE

Question

我应该在三个模型中使用 RMSE 来相互比较它们，以确定哪个更好。我应该运行的模型是生存决策树、随机生存森林和装袋。我一直在运行调整我的模型，但最后我只有一些预测。我在下面带来了随机生存森林结果。我应该怎么做才能拥有 RMSE？

library(survival)

library(randomForestSRC)

dataset<-data.frame(data)

dataset

n.sample=round(0.5*nrow(dataset))

dataset1=sample (1: nrow(dataset),n.sample)

train=data[dataset1,]

test= data[-dataset1 ,]

set.seed(1369)

rsf0=rfsrc(Surv(time,status)~.,train,importance=TRUE,forest=T,ensemble="oob",mtry=NULL,block.size=1,splitrule="logrank")

print(rsf0)

结果：
样本量：821
死亡人数：209
树数：1000
森林终端节点大小：15
平均数终端节点数：38.62
每次拆分尝试的变量数：4
总号变量数：14
重采样用于种树：swor
用于种树的重采样大小：519
分析：RSF
家庭：幸存者
拆分规则：logrank random
随机分割点数：10
错误率：36.15%

Answer 1

我认为您对生存分析模型通常用于什么有点误解。通常我们想要预测生存时间的分布而不是生存时间本身。 RMSE 只能在预测实际生存时间时使用。在您的示例中，您讨论的模型进行了分布预测。

所以首先我稍微清理了你的代码并添加了一个示例数据集以使其可重现：

library(survival)
library(randomForestSRC)

# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)

# note that you need to set.seed before you use `sample`
set.seed(1369)

# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)

# train the random forest model on the training data
rsf0 = rfsrc(Surv(time,status)~., dataset[train, ], importance=TRUE, forest=T, 
ensemble="oob", mtry=NULL, block.size=1, splitrule="logrank")

# now make predictions
predictions = predict(rsf0, newdata = dataset[-train, ])

# view the predicted survival probabilities
predictions$survival

有了这些概率，您必须决定如何将它们转换为生存时间预测，然后您必须在首先删除所有删失观察值后手动计算 RMSE。生存时间的常见转换是取预测的个体分布的平均值或中位数。

作为替代方案，并在此处插入我自己的包，您可以使用 {mlr3proba} 为您执行此操作：

# load required packages
library(mlr3); library(mlr3proba);library(mlr3extralearners); library(mlr3pipelines)

# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)

# note that you need to set.seed before you use `sample`
set.seed(1369)

# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)

# select the random forest model and use the `crankcompositor` to automatically
# create survival time predictions
learn = ppl("crankcompositor", lrn("surv.rfsrc"), response = TRUE, graph_learner = TRUE)

# create a task which stores your dataset
task = TaskSurv$new("data", backend = dataset, time = "time", event = "status")

# train your learner on training data
learn$train(task, row_ids = train)

# make predictions on test data
predictions = learn$predict(task, row_ids = test)

# view your survival time predictions
predictions$response

# calculate RMSE
predictions$score(msr("surv.rmse"))

如果您不习惯 R6，第二个选项会更复杂，但我怀疑在您的用例中它会对您有所帮助，因为您还可以同时比较多个模型。

在 R 程序的随机生存林中有 RMSE

Have RMSE in Random Survival Forest in R program

random-forest

survival-analysis