R 中套索回归中的假人

Dummies in Lasso Regression in R

我有一个包含 690 个观测值的数据集,其中包含分类变量和数值变量。我想执行套索回归,但是当绘制套索曲线时它并不平滑,我想知道假人或其他人是否存在问题。 我重现了一个示例数据集:

num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t") 
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)

这是代码:

library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)


xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)

xxx <- data.matrix(xxx)

library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy, 
                         family='binomial', alpha=1, 
                         parallel=TRUE, standardize=TRUE, type.measure='auc')

这里可以看到我的剧情:

如果您查看输出:

library(fastDummies)
ex = data.frame(num1, cat1, cat2, x)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)

head(ex)

  num1 cat1 cat2 x cat1_b cat2_t cat2_uu
1    1    a   gg 0      0      0       0
2    2    b   uu 0      1      0       1
3    3    a    t 1      0      1       0
4    4    a    t 1      0      1       0
5    5    b    t 0      1      1       0
6    6    a   uu 0      0      0       1

你需要的实际上是cat1_b cat2_t cat2_uu,基本上你的分类列转换为二进制。取前三列是错误的,您正在将因子转换为数值。

所以我们可以这样做:

ex = data.frame(num1, cat1, cat2, x)
xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)

关于AUC曲线的部分,你的数据很少,只有15个变量,所以可能有点不稳定。你可以把它想象成,一旦 lambda 变高(向右),你的非零系数就会减少,估计就会变得不稳定。您可以在您的完整数据集上再试一次,看看它是否有变化。

下面我使用了一个示例数据集,您可以看到它与虚拟变量一起工作得很好:

coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')

df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
df = df[complete.cases(df),]

sel = names(which(sapply(df[,-ncol(df)],is.factor)))
idx = sample(nrow(df),2000)

X = dummy_cols(df[,-ncol(df)],select_columns=sel,
remove_selected_columns =TRUE)[idx,]

Y =as.numeric(df$label)[idx]-1

fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
plot(fit)