使用子集数据帧的 R 中的 h2o 包问题导致近乎完美的预测精度

Issue with h2o Package in R using subsetted dataframes leading to near perfect prediction accuracy

我被这个问题难住了很长时间,想不通。我相信这个问题源于 data.frame 对象的子集保留了父对象的信息,但我也觉得它在我认为只是我的训练集上训练 h2o.deeplearning 模型时引起了问题(尽管这可能不是真的).请参阅下面的示例代码。我包含了注释以阐明我在做什么,但它是相当短的代码:

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test]) 

问题是,如果我根据我的测试子集对其进行评估,我得到的错误率几乎为 0%:

[1] 0.0007531255

有人遇到过这个问题吗?知道如何缓解这个问题吗?

使用H2O函数加载数据并拆分会更高效

data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data))  #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8)  #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)

如果没有看到dataset.csv的内容并能够尝试,很难说你的原始代码有什么问题。我的猜测是,train 和 test 并没有被拆分,它实际上是在测试数据上进行训练。