R predict.glm when newdata 具有较少的级别
R predict.glm when newdata has fewer levels
我试图向自己证明,当 newdata
的标签和水平(因子水平的基础整数)与训练数据不匹配时,predict()
不会给出错误的预测.
我想我确实证明了这一点,我在下面分享了该代码,但我只想问一下 R 在预测 newdata
时到底在做什么。我知道它不是将 newdata
附加到训练数据,它是否在预测之前将 newdata
的因子标签转换为训练数据的相应表示?
options(stringsAsFactors = TRUE)
dat <- data.frame(x = rep(c("cat", "dog", "bird", "horse"), 100), y = rgamma(100, shape=3, scale = 300))
model <- glm(y~., family = Gamma(link = "log"), data = dat)
coefficients(model)
# (Intercept) xcat xdog xhorse
# 6.5816536 0.2924488 0.3586094 0.2740487
newdata1 <- data.frame(x = "cat")
newdata2 <- data.frame(x = "bird")
newdata3 <- data.frame(x = "dog")
predict.glm(object = model, newdata = newdata1, type = "response")
# 1
# 966.907
exp(6.5816536 + 0.2924488) #intercept + cat coef
# [1] 966.9071
predict.glm(object = model, newdata = newdata2, type = "response")
# 1
# 721.7318
exp(6.5816536)
# [1] 721.7318
predict.glm(object = model, newdata = newdata3, type = "response")
# 1
# 1033.042
exp(6.5816536 + 0.3586094)
# [1] 1033.042
unclass(dat$x)
# [1] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [87] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [173] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [259] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [345] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# attr(,"levels")
# [1] "bird" "cat" "dog" "horse"
unclass(newdata1$x)
# [1] 1
# attr(,"levels")
# [1] "cat"
unclass(newdata2$x)
# [1] 1
# attr(,"levels")
# [1] "bird"
模型对象有一个 xlevels
记录因子水平用于模型估计。对于您的示例,我们有:
model$xlevels
#$x
#[1] "bird" "cat" "dog" "horse"
当您的新数据出现在预测中时,将匹配因子水平。例如,您的 newdata1
将匹配到 "cat" 个级别,这是 xlevels
中的第二个级别。因此,predict 将毫不费力地找到该级别的正确系数。
我试图向自己证明,当 newdata
的标签和水平(因子水平的基础整数)与训练数据不匹配时,predict()
不会给出错误的预测.
我想我确实证明了这一点,我在下面分享了该代码,但我只想问一下 R 在预测 newdata
时到底在做什么。我知道它不是将 newdata
附加到训练数据,它是否在预测之前将 newdata
的因子标签转换为训练数据的相应表示?
options(stringsAsFactors = TRUE)
dat <- data.frame(x = rep(c("cat", "dog", "bird", "horse"), 100), y = rgamma(100, shape=3, scale = 300))
model <- glm(y~., family = Gamma(link = "log"), data = dat)
coefficients(model)
# (Intercept) xcat xdog xhorse
# 6.5816536 0.2924488 0.3586094 0.2740487
newdata1 <- data.frame(x = "cat")
newdata2 <- data.frame(x = "bird")
newdata3 <- data.frame(x = "dog")
predict.glm(object = model, newdata = newdata1, type = "response")
# 1
# 966.907
exp(6.5816536 + 0.2924488) #intercept + cat coef
# [1] 966.9071
predict.glm(object = model, newdata = newdata2, type = "response")
# 1
# 721.7318
exp(6.5816536)
# [1] 721.7318
predict.glm(object = model, newdata = newdata3, type = "response")
# 1
# 1033.042
exp(6.5816536 + 0.3586094)
# [1] 1033.042
unclass(dat$x)
# [1] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [87] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [173] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3
# [259] 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# [345] 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4
# attr(,"levels")
# [1] "bird" "cat" "dog" "horse"
unclass(newdata1$x)
# [1] 1
# attr(,"levels")
# [1] "cat"
unclass(newdata2$x)
# [1] 1
# attr(,"levels")
# [1] "bird"
模型对象有一个 xlevels
记录因子水平用于模型估计。对于您的示例,我们有:
model$xlevels
#$x
#[1] "bird" "cat" "dog" "horse"
当您的新数据出现在预测中时,将匹配因子水平。例如,您的 newdata1
将匹配到 "cat" 个级别,这是 xlevels
中的第二个级别。因此,predict 将毫不费力地找到该级别的正确系数。