用“NA”替换测试数据集中的新因子水平时出错

Question

我已将我的数据集拆分为测试和训练数据集。我试图在训练集上进行回归，然后在测试集上使用预测。当我这样做时，我收到一条错误消息："Error in model.frame factor x has New Levels"。我知道这是因为我的测试数据中有一些水平在我的训练数据中看不到。

我想做的只是消除或忽略两个数据集中不存在的水平。我试过这样做，但它没有将任何级别设置为 NA，并且 id 对象表示 "integer (empty)":

id <- which(!(test$x %in% levels (train$x))
train$x[id] <- NA

fit <- lm(y ~ x, data=train)
P <- predict(fit,test)

Answer 1

您的代码会出现 "replacement length differs" 错误。

id <- which(!(test$x %in% levels (train$x))

告诉你 test$x 中的哪些元素不在 levels(train$x) 中，所以你应该使用 id 来索引 test$x，而不是 train$x，当做替换。

test$x[id] <- NA
test$x <- droplevels(test$x)  ## also don't forget to remove unused factor levels

fit <- lm(y ~ x, data = train)
P <- predict(fit, test)

train 中的所有数据将用于构建您的线性回归模型。 P 中的一些预测将是 NA。

I'm still unable to get the id object to correctly identify which levels are not in both data sets. In the work-space it just shows integer(0).

那你问的意义何在？？！！ test$x的所有关卡都在levels(train$x)内，没有新的关卡。

用“NA”替换测试数据集中的新因子水平时出错

Error when replacing new factor levels in test dataset with `NA`

regression

r

levels

linear-regression

predict