去除异常值后 R 中的回归

Regression in R after removing outliers

我有以下 data.frame:

time        values    outlier  
20/01/2010   11         no          
20/02/2010   12         no
20/03/2010   11         no
20/04/2010   12         no
20/05/2010   10         no
20/06/2010   20         yes
20/07/2010   11         no
20/02/2010   12         no

我想 运行 对此数据框进行回归,values 作为我的自变量,time 作为因变量。但我想排除 outlier 列中带有 "yes" 的所有行。

这是我尝试过的:

temp <- subset(df, outlier==yes)
fit  <- lm(as.vector(temp$value) ~ as.vector(temp$time))
slope   <- fit$coefficients[[2]]
intrcpt <- fit$coefficients[[1]]

temp$regression_points <- temp$value*fit$coefficients[[2]]+fit$coefficients[[1]]

现在我想使用获得的回归模型来预测 temp 的原始值,并将结果放回原始数据框中,如下所示:

time        values    outlier      regression_points  
20/01/2010   11         no                11
20/02/2010   12         no                11
20/03/2010   11         no                11
20/04/2010   12         no                11
20/05/2010   10         no                11
20/06/2010   20         yes               
20/07/2010   11         no                11
20/02/2010   12         no                11

我该如何解决这个问题。

请查看以下代码

# Create example data
set.seed(1)
df <- data.frame(time = as.Date(1:100), value = runif(100), outlier = sample(0:1, 100, TRUE))

# Fit model for non-outliers
fit <- lm(value ~ time, df[df$outlier == 0, ] )

# Estimate fitted values for those that are not-outliers
df$regression_points <- ifelse(df$outlier, NA, fitted(fit, df))

#     time     value    outlier regression_points
# 1 1970-01-02 0.2655087       1                NA
# 2 1970-01-03 0.3721239       0         0.5866995
# 3 1970-01-04 0.5728534       0         0.5834598

创建一个新的数据框,df2,将异常值排除在外,然后用 na.exclude:

进行拟合
df2 <- transform(df, values = ifelse(outlier == "no", values, NA))
fm <- lm(values ~ time, df2, na.action = na.exclude)
transform(df, fitted = fitted(fm))

给予:

        time values outlier   fitted
1 2010-01-20     11      no 11.64579
2 2010-02-20     12      no 11.49318
3 2010-03-20     11      no 11.35534
4 2010-04-20     12      no 11.20273
5 2010-05-20     10      no 11.05504
6 2010-06-20     20     yes       NA
7 2010-07-20     11      no 10.75474
8 2010-02-20     12      no 11.49318

注意: 使用的输入,以可重现的形式,是:

Lines <- 
"time        values    outlier  
20/01/2010   11         no          
20/02/2010   12         no
20/03/2010   11         no
20/04/2010   12         no
20/05/2010   10         no
20/06/2010   20         yes
20/07/2010   11         no
20/02/2010   12         no"

df <- read.table(text = Lines, header = TRUE)
df$time <- as.Date(df$time, format = "%d/%m/%Y")
fit <- lm(values ~ time, subset=outlier=="no", data=df)
df$regression_points <- NA
df$regression_points[df$outlier=="no"] <- fitted(fit)