R 中随机森林的准确度预测不佳

Question

我是 R 的新手，我正在尝试用 1000 预测与不同变量（性别、总花费、评分等）相关的客户类型（商店中的会员或普通客户）我的数据框中的客户信息。我用随机森林创建了一个算法，但准确率约为 49%（OOB 错误率）。我尝试使用 Importance(RFM) 以通过不包括不相关的变量来获得更高的准确度，但我最终得到了大约 51% 的准确度......这是否意味着所有功能之间没有联系或者有没有办法调整它以获得更高的精度？非常感谢。

#Creating a vector that has random sample of training values (70% & 30% samples)
index = sample(2,nrow(df), replace = TRUE, prob=c(0.7,0.3)) 

#Training data
training = df[index==1,]

#Testing data
testing = df[index==2,]

#Random forest model 
RFM = randomForest(as.factor(Customer_type)~., data = training, ntree = 500, do.trace=T)
importance(RFM)

# Evaluating Model Accuracy
customertype_pred = predict(RFM, testing)
testing$customertype_pred = customertype_pred
View(testing)

#Building confusion Matrix to compare
CFM = table(testing$Customer_type, testing$customertype_pred)
CFM```

Answer 1

没有您的数据或可重现的示例，很难真正改进您的模型。我可以向您推荐一些程序和程序包，它们可以在此类任务中为您提供很多帮助。看看 caret 包，它专为模型调整而设计。这个包真的很好documented，有很多有用的例子。在这里，我可以展示使用 caret:

的一般工作流程

#load library and the data for this example
library(caret)
#this is a caret built-in dataset
data(GermanCredit)
df <- GermanCredit[,1:10]
str(GermanCredit)
#caret offers useful function for data splitting. Here we split the data according to 
#the class column (our outcome to be predict), in 80% training and 20% testing data
ind <- createDataPartition(df$Class,p=0.8,list = F)
training <- df[ind,]
test <- df[-ind,]

#here we set the resampling method for hyperparameters tuning
#in this case we choose 10-fold cross validation
cn <- trainControl(method = "cv",number = 10)
#the grid of hyperparameters with which to tune the model
grid <- expand.grid(mtry=2:(ncol(training)-1))

#here is the proper model fitting. We fit a random forests model (method="rf") using 
#Class as outcome and all other variables as predictors, using the selected resampling 
#method and tuning grid
fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid
)

模型的输出如下所示：

Random Forest 

800 samples
9 predictor
2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
Resampling results across tuning parameters:

mtry  Accuracy  Kappa    
2     0.71125   0.1511164
3     0.70875   0.1937589
4     0.70000   0.1790469
5     0.70000   0.1819945
6     0.70375   0.1942889
7     0.70250   0.1955456
8     0.70625   0.2025015
9     0.69750   0.1887295

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

如你所见，函数train为每一个调整参数的值（在本例中只有mtry）构建了一个randomForests模型，并根据精度最高的模型。最终的参数设置用于构建最终模型，其中所有数据都提供给 train（在本例中是对 training data.frame 的所有观察）。输出给出了重采样性能，这通常是乐观的。要针对 test 集测试模型的准确性，我们可以这样做：

#predict the output on the test set.
p <- predict(fit,test[,-10])
#this function built a confusion matrix and calculate a lot of accuracies statistics
confusionMatrix(p,test$Class)

您可以使用 train 的 ... 参数将所选模型的特定参数（在本例中为 randomForest）添加到函数中。像这样：

fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid,
ntrees=200# grown 200 trees
)

找到最佳变量集（也称为变量选择或特征选择）caret 有很多有用的函数。包中有 variable selection 的整个小插图部分，包括简单过滤器、向后选择、递归特征消除、遗传算法、模拟退火，显然还有许多模型的内置特征选择方法（如randomForest 的可变重要性）。然而，特征选择是一个很大的话题，我建议你从caret包中的方法开始，如果你没有找到你要找的东西，再深入挖掘。

R 中随机森林的准确度预测不佳

Poor Accuracy Prediction with random forest in R

r

prediction

random-forest