as.h2o 在我的目标变量中创建了 3 个级别而不是 2 个级别，因此它使模型成为跨国模型而不是二项式，我该如何防止这种情况发生？

Question

所以我使用 h2o.ai 创建二项式分类模型，但是当我使用 as.h2o 来转换我的数据集。它需要我的目标变量的列 header 这是 "BUY" 并将其添加到级别，因此它不再是 2 级 1 和 2，而是变成了三个级别购买、1 和 2。这使它成为多项式并且不想要我该如何解决这个问题？

when I run perfH2o this is the output:

H2OMultinomialMetrics: gbm

Test Set Metrics: 
=====================

MSE: (Extract with `h2o.mse`) 0.3260208
RMSE: (Extract with `h2o.rmse`) 0.5709823
Logloss: (Extract with `h2o.logloss`) 1.016186
 Mean Per-Class Error: 0.2755556
 R^2: (Extract with `h2o.r2`) -0.1913934
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
   BUY NO YES  Error      Rate
BUY      1  0   0 0.0000 =   0 / 1    #see here it is taking the header and thinking it is a level
NO       0 16   9 0.3600 =  9 / 25
YES      0  7   8 0.4667 =  7 / 15
Totals   1 23  17 0.3902 = 16 / 41

Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-3 Hit Ratios: 
 k hit_ratio
1 1  0.609756
2 2  0.975610
3 3  1.000000

这是我的代码

#Getting packages
#install.packages("dplyr")
library(dplyr)
library(tidyverse)
library(tidyr)
#install.packages("tidyquant") #Used to quickly load the "tidyverse" (dplyr, tidyr, ggplot, etc) 
along with custom, 
#business-report-friendly ggplot themes. Also great for time series analysis (not featured)
library(tidyquant)
#install.packages("unbalanced")
library(unbalanced)#contains various methods for working with unbalanced data. I will be using 
ubSMOTE() function

#installing H20 latest stable release H20 is a professional machine learning package

# The following two commands remove any previously installed H2O packages for R.
#if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
#if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
#pkgs <- c("RCurl","jsonlite")
#for (pkg in pkgs) {
# if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
#}

# Now we download, install and initialize the H2O package for R.
#install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-yule/2/R")

# Finally, let's load H2O and start up an H2O cluster
library(h2o)
h2o.init()

#Now getting the data
ngData <- read.csv(file.choose())

#Now I am going to create my Train, validation, and test set
splitPercentage1 <- .70
splitPercentage2 <- .5
numRows1 <- nrow(ngData)
sampleSize1 <- floor(splitPercentage1*numRows1)

set.seed(1)

idxTrain1 <- sample(1:numRows1, size = sampleSize1)
validationRaw <- ngData[-idxTrain1,]
trainRaw <- ngData[idxTrain1,]
#validation set created now time to make test set out of validation set

numRows2 <- nrow(validationRaw)
sampleSize2 <- floor(splitPercentage2*numRows2)
idxTrain2 <- sample(1:numRows2, size = sampleSize2)
testRaw <- validationRaw[-idxTrain2,]
validationRaw <- validationRaw[idxTrain2,]

#Now I have a randomly set train set, validation set, and test set

View(trainRaw)
View(testRaw)
View(validationRaw)

#all look good however we need our target variable "BUY" to be a factor not numeric
#also Buy = 1 Sell = 0 in the BUY column

trainRaw[,11] <- as.factor(trainRaw[,11])
testRaw[,11] <- as.factor(testRaw[,11])
validationRaw[,11] <- as.factor(validationRaw[,11])

View(trainRaw)
View(testRaw)
View(validationRaw)

#now to balance the data which i don't know if that is very necessary so I 
#will check how balanced it is

Buytable <- table(trainRaw$BUY)
Buydistr <- prop.table(Buytable)
Buydistr

 #very balanced with 52% sell and 47% buy so no need to balance

 h2o.no_progress()

#converting into h2o data frames
trainH20 <- as.h2o(trainRaw)
validH20 <- as.h2o(validationRaw)
testH20 <- as.h2o(testRaw)

#now to find a classification model

y <- "BUY"
x <- setdiff(names(trainH20), y)

automl_models_h2o <- h2o.automl(
  x = x, 
  y = y,
  training_frame    = trainH20,
  validation_frame  = validH20,
  leaderboard_frame = testH20,
  max_runtime_secs  = 60
 )

 #time to extract the leading model

 NGLeader <- automl_models_h2o@leader

 #making predicitons using h2o.predict()

predH2o <- h2o.predict(NGLeader, newdata = testH20)
as_tibble(predH2o)

#now to check the performance
perfH2o <- h2o.performance(NGLeader, newdata = testH20)
perfH2o

 h2o.r2(perfH2o)
 #very bad r^2

  #turns out my model believes that BUY is one of the possible outcomes of Y so it is multinomial I 
 must fix that

 #######################################################################

这是我的数据的一瞥()：

行数：185

列数：11

$ ï..Month 四月，七月，八月，八月，七月，二月，九月，一月，三月，二月，六月，...

$ East.Region -12, 24, 26, 21, 19, -43, 25, -43, -15, -9, 27, -28, 26, -27, 22, 23, 32, -54, 21, 12, ...

$ Midwest.Region -20, 20, 36, 29, 16, -47, 35, -38, -7, -4, 35, -31, 45, -27, 22, 29, 27、-56、30、14、-...

$ Mountain.Region -4, 6, 4, 3, 2, -6, 3, -10, 2, 0, 9, -2, 5, -9, 5, 3, 6, -6, 4, 2, -4, 5, 5, 3, -1, -7,...

$ Pacific.Region 5, 5, 2, 0, -1, -10, 5, -13, 9, -1, 11, -3, 0, -14, 7, 0, 9 , -11, 0, -3, -8, 5, 5, 6, 0...

$ South.Central.Region 12, 3, 2, -2, -2, -41, 37, -15, 35, 21, 18, 1, 20, -10, 5, -6, 32 , -38, 12, -14, -6, 17...

$ 盐 8, -5, -2, -5, -6, -19, 14, 13, 19, 5, -1, -1, 3, 15, -5, -3, 12, -8、1、-13、-3、3、-2...

$ 非盐 3、7、4、4、3、-22、22、-28、18、16、18、3、17、-25、10、-4、19、-29、11、 -2、-3、15、1...

$ Total.Lower.48 -19, 58, 69, 51, 34, -149, 105, -119, 23, 7, 98, -63, 96, -87, 61, 49, 106, -163, 67, 1...

$ Flow.Change -0.34, -0.06, 0.41, 3.64, -0.47, -0.10, 0.42, -0.51, -1.64, -1.08, -0.15, -0.27, 0.43, ...

$ 买入 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0 , 1, 0, 0, 1, 0, 0, 1, ...

Answer 1

第一个选择（更好，但必须更改更多代码行）不是将数据加载到 R 中，而是用 H2O load it, and also have H2O split 代替。 H2O 加载程序将第一行识别为 header 行并将其视为列名，而不是数据。

第二种方法是去掉 R 代码中的 header 行。

然而，这已经是 read.csv() 的默认行为（header 参数默认为 TRUE）。所以你的数据必须在第一行以外的地方有 "BUY" 。在这种情况下，要么手动修复数据，要么在将坏行加载到 R 后查找并销毁该坏行。

（如果您不同意，能否使用您提供的代码 post 一个演示问题的示例数据文件。）

as.h2o 在我的目标变量中创建了 3 个级别而不是 2 个级别，因此它使模型成为跨国模型而不是二项式，我该如何防止这种情况发生？

as.h2o is creating 3 levels in my target variable instead of 2 levels so it makes the model multinational instead of binomial, how do I prevent this?

r

h2o