为我的 GLM 进行 n 重交叉验证时出现“预测”错误

Question

我运行使用这个函数来做 n 重交叉验证。错误分类率不会随倍数变化，例如如果我运行 10 或 50。我也收到警告：

"Warning message:

'newdata' had 19 rows but variables found have 189 rows"

如果我运行代码不是函数的一部分，它就是我想要的 -> 例如对于 folds==1，它抽出 10%，运行在 90% 的数据上建立模型，并预测另外 10%。有没有人知道为什么它没有显示变量和折叠次数的变化？

library("MASS")  
data(birthwt)
data=birthwt

n.folds=10

jim = function(x,y,n.folds,data){

  for(i in 1:n.folds){
    folds <- cut(seq(1,nrow(data)),breaks=n.folds,labels=FALSE)      
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    glm.train <- glm(y ~ x, family = binomial, data=trainData)
    predictions=predict(glm.train, newdata =testData, type='response')
    pred.class=ifelse(predictions< 0, 0, 1)
    }

  rate=sum(pred.class!= y) / length(y)
  print(head(rate))
  }

jim(birthwt$smoke, birthwt$low, 10, birthwt)

Answer 1

我现在正在将我的评论变成一个答案。

jim <- function(x, y, n.folds, data) {   

  pred.class <- numeric(0)  ## initially empty; accumulated later
  for(i in 1:n.folds){
    folds <- cut(seq(1,nrow(data)), breaks = n.folds, labels = FALSE)  
    testIndexes <- which(folds == i)  ## no need for `arr.ind = TRUE`
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    ## `reformulate` constructs formula from strings. Read `?reformulate`
    glm.train <- glm(reformulate(x, y), family = binomial, data = trainData)
    predictions <- predict(glm.train, newdata = testData, type = 'response')
    ## accumulate the result using `c()`
    ## change `predictions < 0` to `predictions < 0.5` as `type = response`
    pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1))
    }

  ## to access a column with string, use `[[]]` not `$`
  rate <- sum(pred.class!= data[[y]]) / length(data[[y]])
  rate  ## or `return(rate)`
  }

jim("smoke", "low", 10, birthwt)
# [1] 0.3121693

备注：

不需要在这里放arr.ind = TRUE，虽然它没有副作用。
您的分类有问题。你设置 type = "response"，然后你使用 ifelse(predictions < 0, 0, 1)。想一想，pred.class.
for 循环的每次迭代都会覆盖 pred.class。我想你想积累结果。 pred.class <- c(pred.class, ifelse(predictions < 0.5, 0, 1));
错误使用 glm 和 predict。在模型公式中输入 $ 是错误的。请阅读Predict() - Maybe I'm not understanding it。在这里，我更改了您的函数以接受变量名称（作为字符串），并在 glm 中使用适当的模型公式。请注意，此更改需要将 y 和 data[[y]] 放在 rate = sum(pred.class!= y) / length(y).
您可能想要 return rate 而不是仅仅将其打印到屏幕上。因此，用显式 return(rate) 或隐式 rate.

print

你可以把ifelse(predictions < 0.5, 0, 1)换成as.integer(predictions < 0.5)，虽然我上面没改。

为我的 GLM 进行 n 重交叉验证时出现“预测”错误

`predict` error while doing n-fold cross-validation for my GLM

regression

r

predict

glm

cross-validation