如何在 R 中测试逻辑回归模型？

Question

我正在为 Kaggle 竞赛 (link) 开发 CTR 预测模型。我已经从训练集中读取了前 100,000 行数据，然后通过

将其进一步拆分为 train/test 80/20 组

ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]

然后我使用 ad_train 数据开发了一个 GLM 模型

ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)

但是每当我尝试使用预测函数来检查它在 ad_test 集上的表现时，我都会收到错误消息：

test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows

什么给了？如何在新数据上测试我的 GLM 模型？

编辑：效果很好。只需要改为执行此调用：

ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

Answer 1

发生这种情况是因为您在模型公式中包含了每个变量的数据框名称。相反，您的公式应该是：

glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

如重复通知中 second link 所述：

This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.

When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.

如何在 R 中测试逻辑回归模型？

How to test a logistic regression model in R?

r

machine-learning

kaggle