如何根据另一个线性回归从线性回归中过滤掉行

Question

我想进行包含三个步骤的线性回归：1) 运行对所有数据点的回归 2) 取出使用rstandard 3) 运行的绝对距离值找到的10 个异常值再次在新数据框上进行回归。我知道如何手动完成，但这些非常尴尬。有没有办法自动完成？取出柱子也可以吗？

这是我的玩具数据框和代码（我将取出 2 个最高异常值）：

df <- read.table(text = "userid target birds    wolfs     
                 222       1        9         7 
                 444       1        8         4 
                 234       0        2         8 
                 543       1        2         3 
                 678       1        8         3 
                 987       0        1         2 
                 294       1        7         16 
                 608       0        1         5 
                 123       1        17        7 
                 321       1        8         7 
                 226       0        2         7 
                 556       0        20        3 
                 334       1        6         3 
                 225       0        1         1 
                 999       0        3         11 
                 987       0        30         1  ",header = TRUE) 
model<- lm(target~ birds+ wolfs,data=df)
rstandard <- abs(rstandard(model))
df<-cbind(df,rstandard)
g<-subset(df,rstandard > sort(unique(rstandard),decreasing=T)[3])
g
       userid target birds wolfs rstandard    
    4     543      1     2     3  1.189858    
   13    334      1     6     3  1.122579  

   modelNew<- lm(target~ birds+ wolfs,data=df[-c(4,13),])

Answer 1

我不明白如果不估计两个模型怎么能做到这一点，第一个模型确定最有影响力的案例，第二个模型没有这些案例的数据。您可以简化代码并避免使工作区混乱，但是，通过一次完成所有操作，并将子集化过程嵌入调用中以估计 "final" 模型。这是针对您给出的示例执行此操作的代码：

model <- lm(target ~ birds + wolfs,
    data = df[-(as.numeric(names(sort(abs(rstandard(lm(target ~ birds + wolfs, data=df))), decreasing=TRUE)))[1:2]),])

在这里，初始模型、影响评估和随后的数据子集都内置在第一个 data = 之后的代码中。

另请注意，生成的模型将不同于您的代码生成的模型。那是因为您的 g 没有正确识别出两个最有影响力的案例，如果您只是观察 abs(rstandard(lm(target ~ birds + wolfs, data=df))) 的结果，您就会看到这一点。我认为这与您使用unique()有关，这似乎没有必要，但我不确定。

如何根据另一个线性回归从线性回归中过滤掉行

How can I filter out rows from linear regression based on another linear regression

r

linear-regression