从随机森林结果中检索实例

Question

我正在修改我的随机森林模型中的特征，不知何故我发现大量实例被错误分类，我怎样才能找出那些被错误分类的案例的用户标识？

  fit1 <- cforest((b == 'three')~   affect+ certain+ negemo+ future+swear+sad
            +negate+ppron+sexual+death + filler+leisure + conj+ funct + i
            +future + past + bio + body+cause + cogmech + death +
            discrep + future +incl + motion + quant + sad + tentat + excl+insight +percept +posemo
            +ppron +quant + relativ + space + article
            , data = trainset1, 
            controls=cforest_unbiased(ntree=1000, mtry= 1))

 table1 <- table(predict(fit1, OOB=TRUE, type = 'response') > 0.5, trainset1$b == 'three')

结果

        FALSE TRUE
 FALSE   213  200
 TRUE    821 1121

结果显示，其他类中有821个被误分类为"three"，我如何根据userid检索这821个案例，以便比较它们的特征。谢谢你。

Answer 1

所以您想使用一些已经用于创建 table 的代码，并用它来挑选出要放入 table 左下角的行.

下面是使您的 table 正常工作的代码：

predict(fit1, OOB=TRUE, type = 'response') > 0.5, trainset1$b == 'three'

如果你运行第一部分，你将得到所有预测的向量：

p<-predict(fit1, OOB=TRUE, type = 'response')

如果您随后应用 >0.5 阈值，您将获得一个 TRUE 和 FALSE 向量，表示您的预测是高于还是低于该阈值：

tf<- p>0.5

现在，最后一部分提供了另一个包含 TRUE 和 FALSE 值的向量，trainset1$b=="three"。您想知道哪些行被 class 化为 "three"（我认为这在 tf 中是 TRUE，即 p>0.5）但实际上不是 class "three"（来自问题 trainset1$b=="three" 的 FALSE）。要解决这个问题，您需要所有 tf ==TRUE AND trainset1$b !="three":

的行

newdata<- trainset1[tf==TRUE & trainset1$b!="three",]

只需仔细检查 nrow(newdata) 是否为 821。

从随机森林结果中检索实例

Retrieving instances from random forest result

r

feature-selection

random-forest