带插入符号的 R 特征选择 - 将结果图限制在前 10 名，并将完整结果存储到数据框中

Question

我对 R 比较陌生，第一次尝试进行特征选择。我遵循了一个在线教程，该教程以 PimaIndiansDiabetes 数据集为例。我在拥有超过 110 个特征的我自己的数据集上重复了本教程中的步骤。

我已经包含了我在下面使用的教程的示例代码。唯一的区别是我的代码有更大的数据集和不同的命名约定。

当我为自己的结果绘制重要性值时，图中出现了 110 多个项目。有谁知道我如何将其限制在前 10 名？

library(mlbench)
library(caret)
# ensure results are repeatable
set.seed(7)

# load the dataset
data(PimaIndiansDiabetes)

# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", 
preProcess="scale", trControl=control)

# estimate variable importance
importance <- varImp(model, scale=FALSE)

# summarize importance

print(importance)

# plot importance
plot(importance)

我还希望能够将这些完整结果存储到数据框中。我尝试了以下命令：

importanceDF <- as.data.frame(importance)

但我收到以下错误

Error in as.data.frame.default(importance) : 
    cannot coerce class ""varImp.train"" to a data.frame

抱歉，如果这是一个简单的问题，我已经尝试使用谷歌搜索但尚未找到有效的答案。

提前致谢，

艾米

编辑：

根据 zacdav 的回答，我应用了以下逻辑：

importance$importance
temp <- importance
temp$importance <- importance$importance[1:5, ]
plot(temp)

但是我注意到当我原创时运行情节（重要性）

示例数据中顺序如下：

             Importance
glucose      0.7881
mass         0.6876
age          0.6869
pregnant     0.6195
pedigree     0.6062
pressure     0.5865
triceps      0.5536
insulin      0.5379

然后当我运行临时$重要性 <- 重要性$重要性[1:5, ] 情节（温度）

我得到以下顺序：

glucose
pregnant
pressure
triceps
insulin

这是根据它们在原始 table 中出现的方式取前 5 行，而不是根据它们的重要性。

我尝试了运行以下操作：

# put into DF
 importanceDF <- importance$importance
# sort
importanceDF_Ordered <- importanceDF[order(-importanceDF$neg),] 
temp <- importanceDF_Ordered

然后最后一行报错：

Error in `$<-.data.frame`(`*tmp*`, "importance", value = list(neg = 
 c(0.619514925373134,  : 
  replacement has 5 rows, data has 8

不知道如何解决这个问题，所以任何帮助都会很棒

Answer 1

查看重要性对象的结构，您会发现它是一个包含三个元素的列表，data.frame 每个响应的重要性值 class 和其他元数据。您可以使用 $ 符号对 data.frame 进行索引。

str(importance)

List of 3
 $ importance:'data.frame': 8 obs. of  2 variables:
  ..$ neg: num [1:8] 0.62 0.788 0.586 0.554 0.538 ...
  ..$ pos: num [1:8] 0.62 0.788 0.586 0.554 0.538 ...
 $ model     : chr "ROC curve"
 $ calledFrom: chr "varImp"
 - attr(*, "class")= chr "varImp.train"

因此，要获得 data.frame，您需要做的就是 importance$importance

至于调整此对象以便您可以绘制您可以调整对象的特征的子集。我建议制作一份副本，这样就不需要重新运行分析。一个粗略的例子如下：

temp <- importance
temp$importance <- importance$importance[1:5, ]
plot(temp)

我选择使用 data.frame 上的 1:5 行索引绘制前五个以覆盖临时对象 data.frame。如果您有兴趣直接调用 plot 方法，请使用 caret:::plot.varImp.train

Answer 2

plot 有一个内置参数，用于获取前 x 个值

plot(importance, top=10)

带插入符号的 R 特征选择 - 将结果图限制在前 10 名，并将完整结果存储到数据框中

R Feature Selection with caret - Limit results plot to top 10 and also store full results into data frame

plot

r

feature-selection

r-caret