R 中套索回归中的假人
Dummies in Lasso Regression in R
我有一个包含 690 个观测值的数据集,其中包含分类变量和数值变量。我想执行套索回归,但是当绘制套索曲线时它并不平滑,我想知道假人或其他人是否存在问题。
我重现了一个示例数据集:
num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t")
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)
这是代码:
library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)
xxx <- data.matrix(xxx)
library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy,
family='binomial', alpha=1,
parallel=TRUE, standardize=TRUE, type.measure='auc')
这里可以看到我的剧情:
如果您查看输出:
library(fastDummies)
ex = data.frame(num1, cat1, cat2, x)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
head(ex)
num1 cat1 cat2 x cat1_b cat2_t cat2_uu
1 1 a gg 0 0 0 0
2 2 b uu 0 1 0 1
3 3 a t 1 0 1 0
4 4 a t 1 0 1 0
5 5 b t 0 1 1 0
6 6 a uu 0 0 0 1
你需要的实际上是cat1_b
cat2_t
cat2_uu
,基本上你的分类列转换为二进制。取前三列是错误的,您正在将因子转换为数值。
所以我们可以这样做:
ex = data.frame(num1, cat1, cat2, x)
xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)
关于AUC曲线的部分,你的数据很少,只有15个变量,所以可能有点不稳定。你可以把它想象成,一旦 lambda 变高(向右),你的非零系数就会减少,估计就会变得不稳定。您可以在您的完整数据集上再试一次,看看它是否有变化。
下面我使用了一个示例数据集,您可以看到它与虚拟变量一起工作得很好:
coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')
df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
df = df[complete.cases(df),]
sel = names(which(sapply(df[,-ncol(df)],is.factor)))
idx = sample(nrow(df),2000)
X = dummy_cols(df[,-ncol(df)],select_columns=sel,
remove_selected_columns =TRUE)[idx,]
Y =as.numeric(df$label)[idx]-1
fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
plot(fit)
我有一个包含 690 个观测值的数据集,其中包含分类变量和数值变量。我想执行套索回归,但是当绘制套索曲线时它并不平滑,我想知道假人或其他人是否存在问题。 我重现了一个示例数据集:
num1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
cat1 = c("a", "b", "a", "a", "b", "a", "b", "a", "a", "b")
cat2 = c("gg", "uu", "t", "t", "t", "uu", "uu", "gg", "t", "t")
x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 0)
ex = data.frame(num1, cat1, cat2, x)
这是代码:
library(fastDummies)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
xxx <- ex[,1:3]
yyy <- ex$x
unique(yyy)
xxx <- data.matrix(xxx)
library(glmnet)
set.seed(999)
mod.lasso <- cv.glmnet(xxx, yyy,
family='binomial', alpha=1,
parallel=TRUE, standardize=TRUE, type.measure='auc')
这里可以看到我的剧情:
如果您查看输出:
library(fastDummies)
ex = data.frame(num1, cat1, cat2, x)
ex <- dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE)
head(ex)
num1 cat1 cat2 x cat1_b cat2_t cat2_uu
1 1 a gg 0 0 0 0
2 2 b uu 0 1 0 1
3 3 a t 1 0 1 0
4 4 a t 1 0 1 0
5 5 b t 0 1 1 0
6 6 a uu 0 0 0 1
你需要的实际上是cat1_b
cat2_t
cat2_uu
,基本上你的分类列转换为二进制。取前三列是错误的,您正在将因子转换为数值。
所以我们可以这样做:
ex = data.frame(num1, cat1, cat2, x)
xxx = dummy_cols(ex, select_columns = c("cat1", "cat2"), remove_first_dummy = TRUE,remove_selected_columns =TRUE)
关于AUC曲线的部分,你的数据很少,只有15个变量,所以可能有点不稳定。你可以把它想象成,一旦 lambda 变高(向右),你的非零系数就会减少,估计就会变得不稳定。您可以在您的完整数据集上再试一次,看看它是否有变化。
下面我使用了一个示例数据集,您可以看到它与虚拟变量一起工作得很好:
coln=c('age','workclass','fnlwgt','edu','edu_num','maritial','occ','relationship','race','sex','capital-gain','capital-loss','hours-per-week','country','label')
df = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",col.names=coln,na.strings = " ?")
df = df[complete.cases(df),]
sel = names(which(sapply(df[,-ncol(df)],is.factor)))
idx = sample(nrow(df),2000)
X = dummy_cols(df[,-ncol(df)],select_columns=sel,
remove_selected_columns =TRUE)[idx,]
Y =as.numeric(df$label)[idx]-1
fit = cv.glmnet(x=as.matrix(X),y=Y,family="binomial",type.measure="auc")
plot(fit)