理解 gbm 生存预测模型
Making sense of gbm survival prediction model
我是使用和理解 ML 方法的新手,目前使用 R 中的 gbm
包进行生存分析。
我很难理解生存预测模型的一些输出。我已经检查了 this tutorial and this post,但仍然无法理解输出的生存预测模型。
这是我根据示例数据进行分析的代码:
rm(list=ls(all=TRUE))
library(randomForestSRC)
library(gbm)
library(survival)
library(Hmisc)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
model <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph",
cv.folds = 5)
summary(model)
best.iter <- gbm.perf(model, plot.it = TRUE, method = 'cv',
overlay = TRUE) #to get the optimal number of Boosting iterations
best.iter
#Us the best number of tree to produce predicted values for each observation in newdata
# return a vector of prediction on n.trees indicting log hazard scale.f(x)
# By default the predictions are on log hazard scale for coxph
# proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
# estimate the f(x) component of the hazard function
pred.train <- predict(object=model, newdata=data.train, n.trees = best.iter)
pred.test <- predict(object=model, newdata=data.test, n.trees = best.iter)
#trainig set
Hmisc::rcorr.cens(-pred.train, Surv(data.train$days, data.train$status))
#val set
Hmisc::rcorr.cens(-pred.test, Surv(data.test$days, data.test$status))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = data.train$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all:
surv.rate <- exp(-exp(pred.train)*basehaz.cum)
surv.rate
res_train <- data.train
# predicted outcome for train set
res_train$pred <- pred.train
res_train$survival_rate <- surv.rate
res_train
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.test$days, #The survival times.
delta=data.test$status, #The censoring indicator
f.x=pred.test, #The predicted values of the regression model on the log hazard scale.
t.eval = data.test$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all at specified time is:
surv.rate <- exp(-exp(pred.test)*basehaz.cum)
surv.rate
res_test <- data.test
# predicted outcome for test set
res_test$pred <- pred.test
res_test$survival_rate <- surv.rate
res_test
#--------------------------------------------------
#Estimate survival rate at time of interest
# Specify time of interest
time.interest <- sort(unique(data.train$days[data.train$status==1]))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = time.interest, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
#For individual $i$ in test set, estimation of survival function is:
surf.i <- exp(-exp(pred.test[1])*basehaz.cum) #survival rate
#Estimation of survival rate of all at specified time is:
specif.time <- time.interest[10]
surv.rate <- exp(-exp(pred.test)*basehaz.cum[10])
cat("Survival Rate of all at time", specif.time, "\n")
print(surv.rate)
predict
函数返回的输出表示风险函数的 f(x)
分量 ( h(t|x)=lambda(t)*exp(f(x)) )。
我的问题:
• 对这里是否可以计算风险比有点困惑?
• 想知道如何将人群分为低风险和高风险人群?我可以依靠风险函数的估计 f(x) 分量来为训练集做评分系统吗?我的目标是建立一个评分系统,在其中显示用于训练和测试集的低风险和高风险组的 KM 图。
• 我如何构建校准曲线图,在其中绘制训练集和测试集的观察到的存活率与预测的存活率?
阿米尔。感谢您阅读我的教程!
正如您提到的“predict
函数返回的输出表示风险函数的 f(x)
分量 (h(t|x)=lambda(t)*exp(f(x))
)”,也许我们需要了解风险函数, 即 h(t|x).
在此之前,请确保您具备生存分析的基本知识。如果没有,建议阅读伟大的post。我认为 post 可以帮助您解决问题。
回到你的问题:
- 确切地说,我们可以通过调用
predict
函数来获得对数尺度的风险比。因此,风险比可以通过 exp()
. 来计算
- 好的!根据风险比的值,我们可以将人群分为 low-risk 和 high-risk 组。或者,您可以使用风险比的中位数作为临界值。我觉得cutoff值应该是从训练集中推导出来的,然后在测试集中进行测试。如果您的模型有效,低组和 high-risk 组的 KM 图将有显着差异(通过 log-rank 统计测试测量)。
- 校准曲线图通常用于评估输出概率或似然范围为 [0.0, 1.0] 的模型的性能。我们可以计算生存函数,然后指定一个感兴趣的时间点,例如5 年。最后,我们将生存概率与指定时间的实际生存状态进行比较,这与我们评估二元分类模型一样。求生存函数的更多细节可以参考我的教程,原理可以参考前面post
我是使用和理解 ML 方法的新手,目前使用 R 中的 gbm
包进行生存分析。
我很难理解生存预测模型的一些输出。我已经检查了 this tutorial and this post,但仍然无法理解输出的生存预测模型。
这是我根据示例数据进行分析的代码:
rm(list=ls(all=TRUE))
library(randomForestSRC)
library(gbm)
library(survival)
library(Hmisc)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
set.seed(9512)
train <- sample(1:nrow(data), round(nrow(data)*0.7))
data.train <- data[train, ]
data.test <- data[-train, ]
set.seed(9741)
model <- gbm(Surv(days, status)~.,
data.train,
interaction.depth=2,
shrinkage=0.01,
n.trees=500,
distribution="coxph",
cv.folds = 5)
summary(model)
best.iter <- gbm.perf(model, plot.it = TRUE, method = 'cv',
overlay = TRUE) #to get the optimal number of Boosting iterations
best.iter
#Us the best number of tree to produce predicted values for each observation in newdata
# return a vector of prediction on n.trees indicting log hazard scale.f(x)
# By default the predictions are on log hazard scale for coxph
# proportional hazard model assumes h(t|x)=lambda(t)*exp(f(x)).
# estimate the f(x) component of the hazard function
pred.train <- predict(object=model, newdata=data.train, n.trees = best.iter)
pred.test <- predict(object=model, newdata=data.test, n.trees = best.iter)
#trainig set
Hmisc::rcorr.cens(-pred.train, Surv(data.train$days, data.train$status))
#val set
Hmisc::rcorr.cens(-pred.test, Surv(data.test$days, data.test$status))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = data.train$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all:
surv.rate <- exp(-exp(pred.train)*basehaz.cum)
surv.rate
res_train <- data.train
# predicted outcome for train set
res_train$pred <- pred.train
res_train$survival_rate <- surv.rate
res_train
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.test$days, #The survival times.
delta=data.test$status, #The censoring indicator
f.x=pred.test, #The predicted values of the regression model on the log hazard scale.
t.eval = data.test$days, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
basehaz.cum
#Estimation of survival rate of all at specified time is:
surv.rate <- exp(-exp(pred.test)*basehaz.cum)
surv.rate
res_test <- data.test
# predicted outcome for test set
res_test$pred <- pred.test
res_test$survival_rate <- surv.rate
res_test
#--------------------------------------------------
#Estimate survival rate at time of interest
# Specify time of interest
time.interest <- sort(unique(data.train$days[data.train$status==1]))
# Estimate the cumulative baseline hazard function using training data
basehaz.cum <- basehaz.gbm(t=data.train$days, #The survival times.
delta=data.train$status, #The censoring indicator
f.x=pred.train, #The predicted values of the regression model on the log hazard scale.
t.eval = time.interest, #Values at which the baseline hazard will be evaluated
cumulative = TRUE, #If TRUE the cumulative survival function will be computed
smooth = FALSE) #If TRUE basehaz.gbm will smooth the estimated baseline hazard using Friedman's super smoother supsmu.
#For individual $i$ in test set, estimation of survival function is:
surf.i <- exp(-exp(pred.test[1])*basehaz.cum) #survival rate
#Estimation of survival rate of all at specified time is:
specif.time <- time.interest[10]
surv.rate <- exp(-exp(pred.test)*basehaz.cum[10])
cat("Survival Rate of all at time", specif.time, "\n")
print(surv.rate)
predict
函数返回的输出表示风险函数的 f(x)
分量 ( h(t|x)=lambda(t)*exp(f(x)) )。
我的问题:
• 对这里是否可以计算风险比有点困惑?
• 想知道如何将人群分为低风险和高风险人群?我可以依靠风险函数的估计 f(x) 分量来为训练集做评分系统吗?我的目标是建立一个评分系统,在其中显示用于训练和测试集的低风险和高风险组的 KM 图。
• 我如何构建校准曲线图,在其中绘制训练集和测试集的观察到的存活率与预测的存活率?
阿米尔。感谢您阅读我的教程!
正如您提到的“predict
函数返回的输出表示风险函数的 f(x)
分量 (h(t|x)=lambda(t)*exp(f(x))
)”,也许我们需要了解风险函数, 即 h(t|x).
在此之前,请确保您具备生存分析的基本知识。如果没有,建议阅读伟大的post。我认为 post 可以帮助您解决问题。
回到你的问题:
- 确切地说,我们可以通过调用
predict
函数来获得对数尺度的风险比。因此,风险比可以通过exp()
. 来计算
- 好的!根据风险比的值,我们可以将人群分为 low-risk 和 high-risk 组。或者,您可以使用风险比的中位数作为临界值。我觉得cutoff值应该是从训练集中推导出来的,然后在测试集中进行测试。如果您的模型有效,低组和 high-risk 组的 KM 图将有显着差异(通过 log-rank 统计测试测量)。
- 校准曲线图通常用于评估输出概率或似然范围为 [0.0, 1.0] 的模型的性能。我们可以计算生存函数,然后指定一个感兴趣的时间点,例如5 年。最后,我们将生存概率与指定时间的实际生存状态进行比较,这与我们评估二元分类模型一样。求生存函数的更多细节可以参考我的教程,原理可以参考前面post