如何在 R 中测试逻辑回归模型?

How to test a logistic regression model in R?

我正在为 Kaggle 竞赛 (link) 开发 CTR 预测模型。我已经从训练集中读取了前 100,000 行数据,然后通过

将其进一步拆分为 train/test 80/20 组
ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]

然后我使用 ad_train 数据开发了一个 GLM 模型

ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)

但是每当我尝试使用预测函数来检查它在 ad_test 集上的表现时,我都会收到错误消息:

test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows 

什么给了?如何在新数据上测试我的 GLM 模型?

编辑:效果很好。只需要改为执行此调用:

ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

发生这种情况是因为您在模型公式中包含了每个变量的数据框名称。相反,您的公式应该是:

glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)

如重复通知中 second link 所述:

This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.

When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.