计算条件推理树的偏差

Computing deviance for conditional inference trees

我正在尝试将条件推理树(通过包 partykit)实现为归纳树,其目的只是描述而不是预测个别情况。根据 Ritschard here, here and there,例如,可以通过交叉表比较响应变量的实际分布和估计分布与可能的基于预测变量的配置文件的关系,即所谓的 ^T T 和表格。 我想使用偏差和其他派生统计数据作为 ctree() 函数获得的对象的 GOF 度量。我正在向自己介绍这个主题,我将非常感谢一些输入,例如一段 R 代码或一些关于编码中可能涉及的 ctree 对象结构的方向。 我认为自己可以从头开始获得目标表和预测表,然后计算偏差公式。我承认我对如何进行完全没有信心。

非常感谢!

Some background information first: We have discussed adding deviance() or logLik() methods for ctree objects. So far we haven't done so because conditional inference trees are not associated with a particular loss function or even likelihood. Instead, only the associations between response and partitioning variables are assessed by means of conditional inference tests using certain influence and regressor transformations. However, for the default regression and classification case, measures of deviance or log-likelihood can be a useful addition in practice. So maybe we will add these methods in future versions.

If you want to consider trees associated with a formal deviance/likelihood, you may consider using the general mob() framework or the lmtree() and glmtree() convenience functions. If only partitioning variables are specified (and no further regressors to be used in every node), these often lead to very similar trees compared to ctree(). But then you can also use AIC() etc.

But to come back to your original question: You can compute deviance/log-likelihood or other loss functions fairly easily if you look at the model response and the fitted response. Alterantively, you can extract a factor variable that indicates the terminal nodes and refit a linear or multinomial model. This will have the same fitted values but also supply deviance() and logLik(). Below, I illustrate this with the airct and irisct trees that you obtain when 运行 example("ctree", package = "partykit").

Regression: The Gaussian deviance is simply the residual sum of squares:

sum((airq$Ozone - predict(airct, newdata = airq, type = "response"))^2)
## [1] 46825.35

The same can be obtained by re-fitting as a linear regression model:

airq$node <- factor(predict(airct, newdata = airq, type = "node"))
airlm <- lm(Ozone ~ node, data = airq)
deviance(airlm)
## [1] 46825.35
logLik(airlm)
## 'log Lik.' -512.6311 (df=6)

Classification: The log-likelihood is simply the sum of the predicted log-probabilities at the observed 类. And the deviance is -2 times the log-likelihood:

irisprob <- predict(irisct, type = "prob")
sum(log(irisprob[cbind(1:nrow(iris), iris$Species)]))
## [1] -15.18056
-2 * sum(log(irisprob[cbind(1:nrow(iris), iris$Species)]))
## [1] 30.36112

Again, this can also be obtained by re-fitting as a multinomial model:

library("nnet")
iris$node <- factor(predict(irisct, newdata = iris, type = "node"))
irismultinom <- multinom(Species ~ node, data = iris, trace = FALSE)
deviance(irismultinom)
## [1] 30.36321
logLik(irismultinom)
## 'log Lik.' -15.1816 (df=8)

See also the discussion in https://stats.stackexchange.com/questions/6581/what-is-deviance-specifically-in-cart-rpart for the connections between regression and classification trees and generalized linear models.