XGboost 模型始终获得 100% 的准确性?
XGboost model consistently obtaining 100% accuracy?
我正在使用 Airbnb 的可用数据 here on Kaggle , and predicting the countries users will book their first trips to with an XGBoost model and almost 600 features in R. Running the algorithm through 50 rounds of 5-fold cross validation, I obtained 100% accuracy each time. After fitting the model to the training data, and predicting on a held out test set, I also obtained 100% accuracy. These results can't be real. There must be something wrong with my code, but so far I haven't been able to figure it out. I've included a section of my code below. It's based on this article。跟随文章(使用文章的数据+复制代码),我收到类似的结果。无论将其应用于 Airbnb 的数据,我始终获得 100% 的准确度。我不知道发生了什么。我是否错误地使用了 xgboost 包?感谢您的帮助和时间。
# set up the data
# train is the data frame of features with the target variable to predict
full_variables <- data.matrix(train[,-1]) # country_destination removed
full_label <- as.numeric(train$country_destination) - 1
# training data
train_index <- caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE)
train_data <- full_variables[train_index, ]
train_label <- full_label[train_index[,1]]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)
# test data
test_data <- full_variables[-train_index, ]
test_label <- full_label[-train_index[,1]]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)
# 5-fold CV
params <- list("objective" = "multi:softprob",
"num_class" = classes,
eta = 0.3,
max_depth = 6)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = F,
maximize = T,
prediction = T)
# out of fold predictions
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p)
# confusion matrix
confusionMatrix(factor(out_of_fold_p$label),
factor(out_of_fold_p$max_prob),
mode = "everything")
可以通过运行此代码在此处找到我用于此的数据示例:
library(RCurl)
x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv")
y <- read.csv(text = x)
如果您使用的是 kaggle 上可用的 train_users_2.csv.zip
,那么问题是您没有从训练数据集中删除 country_destination
,因为它位于 16
而不是 1
.
which(colnames(train) == "country_destination")
#output
16
1
是 id
,它对于每个观察都是唯一的,也应该被删除。
length(unique(train[,1)) == nrow(train)
#output
TRUE
当我运行你的代码做了如下修改:
full_variables <- data.matrix(train[,-c(1, 16)])
library(xgboost)
params <- list("objective" = "multi:softprob",
"num_class" = length(unique(train_label)),
eta = 0.3,
max_depth = 6)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = T,
maximize = T,
prediction = T)
我在使用上述设置对 0.12 进行交叉验证时出现测试错误。
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p[,13:14], 20)
#output
max_prob label
1 8 8
2 12 12
3 12 10
4 12 12
5 12 12
6 12 12
7 12 12
8 12 12
9 8 8
10 12 5
11 12 2
12 2 12
13 12 12
14 12 12
15 12 12
16 8 8
17 8 8
18 12 5
19 8 8
20 12 12
综上所述,您没有从 x
中删除 y
。
编辑:在下载真正的训练集并试玩之后,我可以说 5 折 CV 的准确率真的是 100%。这不仅是通过仅 22 个特征(可能更少)实现的。
model <- xgboost(params = params,
data = train_matrix,
nrounds = 50,
verbose = T,
maximize = T)
该模型在测试集上也获得了 100% 的准确率:
pred <- predict(model, test_matrix)
pred <- matrix(pred, ncol=length(unique(train_label)), byrow = TRUE)
out_of_fold_p <- data.frame(pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = test_label + 1)
sum(out_of_fold_p$max_prob != out_of_fold_p$label) #0 errors
现在让我们检查哪些特征具有歧视性:
xgb.plot.importance(importance_matrix = xgb.importance(colnames(train_matrix), model))
现在,如果您 运行 xgb.cv 具有这些功能:
train_matrix <- xgb.DMatrix(data = train_data[,which(colnames(train_data) %in% xgboost::xgb.importance(colnames(train_matrix), model)$Feature)], label = train_label)
set.seed(1)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = T,
maximize = T,
prediction = T)
您还将在测试折叠中获得 100% 的准确度
部分原因是 类 的巨大失衡:
table(train_label)
train_label
0 1 2 3 4 5 6 7 8 9 10 11
3 10 12 13 36 16 19 856 7 73 3 451
事实上,未成年人 类 很容易通过 1 个虚拟变量区分:
gg <- data.frame(train_data[,which(colnames(train_data) %in% xgb.importance(colnames(train_matrix), model)$Feature)], label = as.factor(train_label))
gg %>%
as.tibble() %>%
select(1:9, 11, 12, 15:21, 23) %>%
gather(key, value, 1:18) %>%
ggplot()+
geom_bar(aes(x = label))+
facet_grid(key ~ value) +
theme(strip.text.y = element_text(angle = 90))
基于 22 个最重要特征中 0/1 的分布,在我看来,即使不是 100% 的准确度,任何树模型都能够达到相当不错的准确度。
人们会期望 类 0 和 10 对于 5 倍 CV 会有问题,因为有可能所有受试者都落入一个倍,这样模型至少在那个例子。如果通过随机抽样设计 CV,这将是一种可能性。 xgb.cv 不会发生这种情况:
lapply(cv_model$folds, function(x){
table(train_label[x])})
我正在使用 Airbnb 的可用数据 here on Kaggle , and predicting the countries users will book their first trips to with an XGBoost model and almost 600 features in R. Running the algorithm through 50 rounds of 5-fold cross validation, I obtained 100% accuracy each time. After fitting the model to the training data, and predicting on a held out test set, I also obtained 100% accuracy. These results can't be real. There must be something wrong with my code, but so far I haven't been able to figure it out. I've included a section of my code below. It's based on this article。跟随文章(使用文章的数据+复制代码),我收到类似的结果。无论将其应用于 Airbnb 的数据,我始终获得 100% 的准确度。我不知道发生了什么。我是否错误地使用了 xgboost 包?感谢您的帮助和时间。
# set up the data
# train is the data frame of features with the target variable to predict
full_variables <- data.matrix(train[,-1]) # country_destination removed
full_label <- as.numeric(train$country_destination) - 1
# training data
train_index <- caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE)
train_data <- full_variables[train_index, ]
train_label <- full_label[train_index[,1]]
train_matrix <- xgb.DMatrix(data = train_data, label = train_label)
# test data
test_data <- full_variables[-train_index, ]
test_label <- full_label[-train_index[,1]]
test_matrix <- xgb.DMatrix(data = test_data, label = test_label)
# 5-fold CV
params <- list("objective" = "multi:softprob",
"num_class" = classes,
eta = 0.3,
max_depth = 6)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = F,
maximize = T,
prediction = T)
# out of fold predictions
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p)
# confusion matrix
confusionMatrix(factor(out_of_fold_p$label),
factor(out_of_fold_p$max_prob),
mode = "everything")
可以通过运行此代码在此处找到我用于此的数据示例:
library(RCurl)
x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv")
y <- read.csv(text = x)
如果您使用的是 kaggle 上可用的 train_users_2.csv.zip
,那么问题是您没有从训练数据集中删除 country_destination
,因为它位于 16
而不是 1
.
which(colnames(train) == "country_destination")
#output
16
1
是 id
,它对于每个观察都是唯一的,也应该被删除。
length(unique(train[,1)) == nrow(train)
#output
TRUE
当我运行你的代码做了如下修改:
full_variables <- data.matrix(train[,-c(1, 16)])
library(xgboost)
params <- list("objective" = "multi:softprob",
"num_class" = length(unique(train_label)),
eta = 0.3,
max_depth = 6)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = T,
maximize = T,
prediction = T)
我在使用上述设置对 0.12 进行交叉验证时出现测试错误。
out_of_fold_p <- data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1)
head(out_of_fold_p[,13:14], 20)
#output
max_prob label
1 8 8
2 12 12
3 12 10
4 12 12
5 12 12
6 12 12
7 12 12
8 12 12
9 8 8
10 12 5
11 12 2
12 2 12
13 12 12
14 12 12
15 12 12
16 8 8
17 8 8
18 12 5
19 8 8
20 12 12
综上所述,您没有从 x
中删除 y
。
编辑:在下载真正的训练集并试玩之后,我可以说 5 折 CV 的准确率真的是 100%。这不仅是通过仅 22 个特征(可能更少)实现的。
model <- xgboost(params = params,
data = train_matrix,
nrounds = 50,
verbose = T,
maximize = T)
该模型在测试集上也获得了 100% 的准确率:
pred <- predict(model, test_matrix)
pred <- matrix(pred, ncol=length(unique(train_label)), byrow = TRUE)
out_of_fold_p <- data.frame(pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = test_label + 1)
sum(out_of_fold_p$max_prob != out_of_fold_p$label) #0 errors
现在让我们检查哪些特征具有歧视性:
xgb.plot.importance(importance_matrix = xgb.importance(colnames(train_matrix), model))
现在,如果您 运行 xgb.cv 具有这些功能:
train_matrix <- xgb.DMatrix(data = train_data[,which(colnames(train_data) %in% xgboost::xgb.importance(colnames(train_matrix), model)$Feature)], label = train_label)
set.seed(1)
cv_model <- xgb.cv(params = params,
data = train_matrix,
nrounds = 50,
nfold = 5,
early_stop_round = 1,
verbose = T,
maximize = T,
prediction = T)
您还将在测试折叠中获得 100% 的准确度
部分原因是 类 的巨大失衡:
table(train_label)
train_label
0 1 2 3 4 5 6 7 8 9 10 11
3 10 12 13 36 16 19 856 7 73 3 451
事实上,未成年人 类 很容易通过 1 个虚拟变量区分:
gg <- data.frame(train_data[,which(colnames(train_data) %in% xgb.importance(colnames(train_matrix), model)$Feature)], label = as.factor(train_label))
gg %>%
as.tibble() %>%
select(1:9, 11, 12, 15:21, 23) %>%
gather(key, value, 1:18) %>%
ggplot()+
geom_bar(aes(x = label))+
facet_grid(key ~ value) +
theme(strip.text.y = element_text(angle = 90))
基于 22 个最重要特征中 0/1 的分布,在我看来,即使不是 100% 的准确度,任何树模型都能够达到相当不错的准确度。
人们会期望 类 0 和 10 对于 5 倍 CV 会有问题,因为有可能所有受试者都落入一个倍,这样模型至少在那个例子。如果通过随机抽样设计 CV,这将是一种可能性。 xgb.cv 不会发生这种情况:
lapply(cv_model$folds, function(x){
table(train_label[x])})