当一个因素在测试集中有新水平时避免失败
Avoid failing when a factor has new levels in test set
我有一个数据集,我按以下方式将其分成训练和测试子集:
train_ind <- sample(seq_len(nrow(dataset)), size=(2/3)*nrow(dataset))
train <- dataset[train_ind]
test <- dataset[-train_ind]
然后,我用它来训练glm:
glm.res <- glm(response ~ ., data=dataset, subset=train_ind, family = binomial(link=logit))
最后,我用它来预测我的测试集:
preds <- predict(glm.res, test, type="response")
根据样本,这会失败并出现错误:
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) :
factor has new levels
请注意,该值出现在完整数据集上,但显然不在训练集中。我想要做的是让预测函数忽略这些新因素。即使它已经对这些因素进行了二值化,我不明白为什么它可以假设新值(因此,不是线性模型中的变量)只是 0,这会产生正确的行为。
有办法吗?
我从以下数据生成过程开始(一个二元响应变量、一个数值自变量和 3 个分类自变量):
set.seed(1)
n <- 500
y <- factor(rbinom(n, size=1, p=0.7))
x1 <- rnorm(n)
x2 <- cut(runif(n), breaks=seq(0,1,0.2))
x3 <- cut(runif(n), breaks=seq(0,1,0.25))
x4 <- cut(runif(n), breaks=seq(0,1,0.1))
df <- data.frame(y, x1, x2, x3, x4)
在这里,我构建训练和测试集的方式是让测试集中的一些分类协变量(x2
和 x3
)比训练集中的类别更多:
idx <- which(df$x2!="(0.6,0.8]" & df$x3!="(0,0.25]")
train_ind <- sample(idx, size=(2/3)*length(idx))
train <- df[train_ind,]
train$x2 <- droplevels(train$x2)
train$x3 <- droplevels(train$x3)
test <- df[-train_ind,]
table(train$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.8,1]
55 40 53 49
table(test$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
58 48 45 90 62
table(train$x3)
(0.25,0.5] (0.5,0.75] (0.75,1]
66 61 70
table(test$x3)
(0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
131 63 47 62
当然,predict
会产生上述@Setzer22 描述的消息错误:
glm.res <- glm(y ~ ., data=train, family = binomial(link=logit))
preds <- predict(glm.res, test, type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action,
xlev = object$xlevels) : factor x2 has new levels (0.6,0.8]
这是删除协变量中具有新水平的 train
行的(不优雅)方法:
dropcats <- function(k) {
xtst <- test[,k]
xtrn <- train[,k]
cmp.tst.trn <- (unique(xtst) %in% unique(xtrn))
if (is.factor(xtst) & any(!cmp.tst.trn)) {
cat.tst <- unique(xtst)
apply(test[,k]==matrix(rep(cat.tst[cmp.tst.trn],each=nrow(test)),
nrow=nrow(test)),1,any)
} else {
rep(TRUE,nrow(test))
}
}
filt <- apply(sapply(2:ncol(df),dropcats),1,all)
subset.test <- test[filt,]
在测试集的子集subset.test
中x2
和x3
没有新的类别:
table(subset.test[,"x2"])
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
26 25 20 0 28
table(subset.test[,"x3"])
(0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
0 29 29 41
现在 predict
运行良好:
preds <- predict(glm.res, subset(test,filt), type="response")
head(preds)
30 39 41 49 55 56
0.7732564 0.8361226 0.7576259 0.5589563 0.8965357 0.8058025
希望对您有所帮助。
我有一个数据集,我按以下方式将其分成训练和测试子集:
train_ind <- sample(seq_len(nrow(dataset)), size=(2/3)*nrow(dataset))
train <- dataset[train_ind]
test <- dataset[-train_ind]
然后,我用它来训练glm:
glm.res <- glm(response ~ ., data=dataset, subset=train_ind, family = binomial(link=logit))
最后,我用它来预测我的测试集:
preds <- predict(glm.res, test, type="response")
根据样本,这会失败并出现错误:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor has new levels
请注意,该值出现在完整数据集上,但显然不在训练集中。我想要做的是让预测函数忽略这些新因素。即使它已经对这些因素进行了二值化,我不明白为什么它可以假设新值(因此,不是线性模型中的变量)只是 0,这会产生正确的行为。
有办法吗?
我从以下数据生成过程开始(一个二元响应变量、一个数值自变量和 3 个分类自变量):
set.seed(1)
n <- 500
y <- factor(rbinom(n, size=1, p=0.7))
x1 <- rnorm(n)
x2 <- cut(runif(n), breaks=seq(0,1,0.2))
x3 <- cut(runif(n), breaks=seq(0,1,0.25))
x4 <- cut(runif(n), breaks=seq(0,1,0.1))
df <- data.frame(y, x1, x2, x3, x4)
在这里,我构建训练和测试集的方式是让测试集中的一些分类协变量(x2
和 x3
)比训练集中的类别更多:
idx <- which(df$x2!="(0.6,0.8]" & df$x3!="(0,0.25]")
train_ind <- sample(idx, size=(2/3)*length(idx))
train <- df[train_ind,]
train$x2 <- droplevels(train$x2)
train$x3 <- droplevels(train$x3)
test <- df[-train_ind,]
table(train$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.8,1]
55 40 53 49
table(test$x2)
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
58 48 45 90 62
table(train$x3)
(0.25,0.5] (0.5,0.75] (0.75,1]
66 61 70
table(test$x3)
(0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
131 63 47 62
当然,predict
会产生上述@Setzer22 描述的消息错误:
glm.res <- glm(y ~ ., data=train, family = binomial(link=logit))
preds <- predict(glm.res, test, type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor x2 has new levels (0.6,0.8]
这是删除协变量中具有新水平的 train
行的(不优雅)方法:
dropcats <- function(k) {
xtst <- test[,k]
xtrn <- train[,k]
cmp.tst.trn <- (unique(xtst) %in% unique(xtrn))
if (is.factor(xtst) & any(!cmp.tst.trn)) {
cat.tst <- unique(xtst)
apply(test[,k]==matrix(rep(cat.tst[cmp.tst.trn],each=nrow(test)),
nrow=nrow(test)),1,any)
} else {
rep(TRUE,nrow(test))
}
}
filt <- apply(sapply(2:ncol(df),dropcats),1,all)
subset.test <- test[filt,]
在测试集的子集subset.test
中x2
和x3
没有新的类别:
table(subset.test[,"x2"])
(0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
26 25 20 0 28
table(subset.test[,"x3"])
(0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
0 29 29 41
现在 predict
运行良好:
preds <- predict(glm.res, subset(test,filt), type="response")
head(preds)
30 39 41 49 55 56
0.7732564 0.8361226 0.7576259 0.5589563 0.8965357 0.8058025
希望对您有所帮助。