如何在 R 中测试逻辑回归模型?
How to test a logistic regression model in R?
我正在为 Kaggle 竞赛 (link) 开发 CTR 预测模型。我已经从训练集中读取了前 100,000 行数据,然后通过
将其进一步拆分为 train/test 80/20 组
ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]
然后我使用 ad_train 数据开发了一个 GLM 模型
ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)
但是每当我尝试使用预测函数来检查它在 ad_test 集上的表现时,我都会收到错误消息:
test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows
什么给了?如何在新数据上测试我的 GLM 模型?
编辑:效果很好。只需要改为执行此调用:
ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)
发生这种情况是因为您在模型公式中包含了每个变量的数据框名称。相反,您的公式应该是:
glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)
如重复通知中 second link 所述:
This is a problem of using different names between your data and your
newdata and not a problem between using vectors or dataframes.
When you fit a model with the lm function and then use predict to make
predictions, predict tries to find the same names on your newdata. In
your first case name x conflicts with mtcars$wt and hence you get the
warning.
我正在为 Kaggle 竞赛 (link) 开发 CTR 预测模型。我已经从训练集中读取了前 100,000 行数据,然后通过
将其进一步拆分为 train/test 80/20 组ad_data <- read.csv("train", header = TRUE, stringsAsFactors = FALSE, nrows = 100000)
trainIndex <- createDataPartition(ad_data$click, p=0.8, list=FALSE, times=1)
ad_train <- ad_data[trainIndex,]
ad_test <- ad_data[-trainIndex,]
然后我使用 ad_train 数据开发了一个 GLM 模型
ad_glm_model <- glm(ad_train$clicks ~ ad_train$C1 + ad_train$site_category + ad_train$device_type, family = binomial(link = "logit"), data = ad_train)
但是每当我尝试使用预测函数来检查它在 ad_test 集上的表现时,我都会收到错误消息:
test_model <- predict(ad_glm_model, newdata = ad_test, type = "response")
Warning message:
'newdata' had 20000 rows but variables found have 80000 rows
什么给了?如何在新数据上测试我的 GLM 模型?
编辑:效果很好。只需要改为执行此调用:
ad_glm_model <- glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)
发生这种情况是因为您在模型公式中包含了每个变量的数据框名称。相反,您的公式应该是:
glm(clicks ~ C1 + site_category + device_type, family = binomial(link = "logit"), data = ad_train)
如重复通知中 second link 所述:
This is a problem of using different names between your data and your newdata and not a problem between using vectors or dataframes.
When you fit a model with the lm function and then use predict to make predictions, predict tries to find the same names on your newdata. In your first case name x conflicts with mtcars$wt and hence you get the warning.