在二项式数据错误上使用 glmnet

Using glmnet on binomial data error

我导入了一些数据如下

surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat",header = T)
x <- as.matrix(select(surv,-ab))
y <- as.matrix(select(surv,ab))
glmnet::cv.glmnet(x,y,alpha=1,,family="binomial",type.measure = "auc")

我收到以下错误。

NAs introduced by coercion
 Show Traceback
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : NA/NaN/Inf in foreign function call (arg 5)

对此有什么好的解决办法?

glmnet 包的文档中有您需要的信息,

surv <- read.table("http://www.stat.ufl.edu/~aa/glm/data/Student_survey.dat", header = T, stringsAsFactors = T)

x <- surv[, -which(colnames(surv) == 'ab')]          # remove the 'ab' column

y <- surv[, 'ab']                                    # the 'binomial' family takes a factor as input (too)

xfact = sapply(1:ncol(x), function(y) is.factor(x[, y]))   # separate the factor from the numeric columns

xfactCols = model.matrix(~.-1, data = x[, xfact])          # one option is to build dummy variables from the factors (the other option is to convert to numeric)

xall = as.matrix(cbind(x[, !xfact], xfactCols))            # cbind() numeric and dummy columns 

fit = glmnet::cv.glmnet(xall,y,alpha=1,family="binomial",type.measure = "auc")       # run glmnet error free

str(fit)
List of 10
 $ lambda    : num [1:89] 0.222 0.202 0.184 0.168 0.153 ...
 $ cvm       : num [1:89] 1.12 1.11 1.1 1.07 1.04 ...
 $ cvsd      : num [1:89] 0.211 0.212 0.211 0.196 0.183 ...
 $ cvup      : num [1:89] 1.33 1.32 1.31 1.27 1.23 ...
 $ cvlo      : num [1:89] 0.908 0.9 0.89 0.874 0.862 ...
 $ nzero     : Named int [1:89] 0 2 2 3 3 3 4 4 5 6 ...
 .....

我遇到了数字和 character/factor 混合数据类型的相同问题。为了转换预测变量,我建议使用 glmnet 包附带的函数来解决这个混合数据类型问题:glmnet::makeX()。它处理虚拟创建,甚至能够在丢失数据的情况下执行简单的插补。

x <- glmnet::makeX(surv[, -which(colnames(surv) == 'ab')])

或更多tidy-ish:

library(tidyverse)

x <- 
  surv %>%
  select(-ab) %>%
  glmnet::makeX()