平衡二项式响应中的 h2o GBM 分类性能差

Poor h2o GBM Classification Performance in a balanced binomial response

在一个相当平衡的二项式 class化响应问题中,我观察到 h2o.gbm class化中异常的错误水平,用于确定 class 0,在训练集上本身。这是一场已经结束的比赛,所以兴趣只在于了解出了什么问题。

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0      1    Error            Rate
0      147857 234035 0.612830  =234035/381892
1       44782 271661 0.141517   =44782/316443
Totals 192639 505696 0.399260  =278817/698335

欢迎任何处理数据和减少错误的专家建议。 尝试了以下方法并没有发现错误减少。 方法 1:通过 h2o.varimp(gbm) 选择前 5 个重要变量 方法 2:将负标准化变量转换为零,将正变量转换为 1。

    #Data Definition

# Variable                        Definition

#Independent Variables

# ID                                Unique ID for each observation
# Timestamp                       Unique value representing one day
# Stock_ID                        Unique ID representing one stock
# Volume                            Normalized values of volume traded of                  given stock ID on that timestamp
# Three_Day_Moving_Average        Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average           Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average            Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average       Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range                        Normalized values of true range for given stock ID
# Average_True_Range                Normalized values of average true range for given stock ID
# Positive_Directional_Movement   Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement   Normalized values of negative directional movement for given stock ID

#Dependent Response Variable
# Outcome                           Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close


temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)


temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)

summary(train)

#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)

rm(train,test)
summary(combi.imp)

#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)

summary(combi.imp)


#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]

library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")

train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])

#Models
gbmF_model_1 = h2o.gbm( x=features,
                        y = y,
                        training_frame =train.hex,
                        seed=1234
)
h2o.performance(gbmF_model_1)

您只使用默认参数训练了一个 GBM,因此您似乎没有投入足够的精力来调整您的模型。我建议使用 h2o.grid() 函数对 GBM 进行随机网格搜索。这是您可以关注的 H2O R code example