使用 bootstrapping 估计斜率系数的 95% 置信区间
Use bootstrapping to estimate 95% confidence interval for slope coefficient
我有以下代码:
set.seed(1)
n <- 100
x <- rnorm(n)
y <- 5 + 3*x + rnorm(n, mean = 0, sd = 0.3)
df <- data.frame(x=x,y=y)
slope <- lm(y~x, data = df)$coefficients[2]
我想使用 100 bootstrap 个样本来估计斜率系数的 95% 置信区间。我以前从未做过 bootstrapping,所以我有点卡住了。我知道我想使用 sample
从 1:n 中抽取 100 个索引进行替换。我也知道我需要斜率的标准误差来计算 CI。但是,当我执行 stderr<- sd(slope)/sqrt(length(slope))
时,我得到 <NA>
。只是不确定如何将其转换为 R。我应该将 CIs 设为 (2.95, 3.05)。
可以通过 sample(c(1, 2, 3, 4, 5), 5, TRUE)
对 c(1, 2, 3, 4, 5)
的单个 bootstrap 样本进行采样。许多可以使用 replicate
绘制:
# this was your original code
set.seed(1)
n <- 100
x <- rnorm(n)
y <- 5 + 3*x + rnorm(n, mean = 0, sd = 0.3)
df <- data.frame(x=x,y=y)
# before we do many bootstrap samples, first define what happens in each one of them
one_regr <- function(){
mysample <- sample(1:n, n, TRUE) # which observations pairs are in this
# bootstrap sample?
my_df <- df[mysample, ]
slope <- lm(y ~x, data = my_df)$coefficients[2]
return(slope)
}
# test run -- works!
one_regr()
# now we can do many replicates with replicate()
many_slopes <- replicate(5000, one_regr())
hist(many_slopes, breaks = 30)
quantile(many_slopes, c(.025, .975))
加法1:
如果您不仅对斜率而且对截距以及
进行采样,您可以从中获得更多乐趣
one_regr <- function(){
mysample <- sample(1:n, n, TRUE) # which observation pairs are in this
# bootstrap sample?
my_df <- df[mysample, ]
return( lm(y ~x, data = my_df)$coefficients )
}
many_lines <- replicate(5000, one_regr())
intercepts <- many_lines[1,]
slopes <- many_lines[2,]
plot(df$x, df$y, main = "data and 500 regression lines")
for(i in 1:5000){ # just draw the regression lines
abline(a = intercepts[i], b = slopes[i], col = "#00009001")
}
plot(slopes, intercepts, main = "bootstrapped coefficients",
pch = 16, col = "#00009010")
quantile(slopes, c(.025, .975))
加法2:
https://www.investopedia.com/terms/s/standard-error.asp 状态:
The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample population.
所以斜率的标准误差是我们 bootstrapped 斜率的标准差:
> sd(slopes)
[1] 0.02781162
粗略地说,这应该是 95%-CI 宽度的 4 倍:
> sd(slopes)
[1] 0.02781162
> abs(2.95 - 3.05)/4
[1] 0.025
我有以下代码:
set.seed(1)
n <- 100
x <- rnorm(n)
y <- 5 + 3*x + rnorm(n, mean = 0, sd = 0.3)
df <- data.frame(x=x,y=y)
slope <- lm(y~x, data = df)$coefficients[2]
我想使用 100 bootstrap 个样本来估计斜率系数的 95% 置信区间。我以前从未做过 bootstrapping,所以我有点卡住了。我知道我想使用 sample
从 1:n 中抽取 100 个索引进行替换。我也知道我需要斜率的标准误差来计算 CI。但是,当我执行 stderr<- sd(slope)/sqrt(length(slope))
时,我得到 <NA>
。只是不确定如何将其转换为 R。我应该将 CIs 设为 (2.95, 3.05)。
可以通过 sample(c(1, 2, 3, 4, 5), 5, TRUE)
对 c(1, 2, 3, 4, 5)
的单个 bootstrap 样本进行采样。许多可以使用 replicate
绘制:
# this was your original code
set.seed(1)
n <- 100
x <- rnorm(n)
y <- 5 + 3*x + rnorm(n, mean = 0, sd = 0.3)
df <- data.frame(x=x,y=y)
# before we do many bootstrap samples, first define what happens in each one of them
one_regr <- function(){
mysample <- sample(1:n, n, TRUE) # which observations pairs are in this
# bootstrap sample?
my_df <- df[mysample, ]
slope <- lm(y ~x, data = my_df)$coefficients[2]
return(slope)
}
# test run -- works!
one_regr()
# now we can do many replicates with replicate()
many_slopes <- replicate(5000, one_regr())
hist(many_slopes, breaks = 30)
quantile(many_slopes, c(.025, .975))
加法1:
如果您不仅对斜率而且对截距以及
进行采样,您可以从中获得更多乐趣one_regr <- function(){
mysample <- sample(1:n, n, TRUE) # which observation pairs are in this
# bootstrap sample?
my_df <- df[mysample, ]
return( lm(y ~x, data = my_df)$coefficients )
}
many_lines <- replicate(5000, one_regr())
intercepts <- many_lines[1,]
slopes <- many_lines[2,]
plot(df$x, df$y, main = "data and 500 regression lines")
for(i in 1:5000){ # just draw the regression lines
abline(a = intercepts[i], b = slopes[i], col = "#00009001")
}
plot(slopes, intercepts, main = "bootstrapped coefficients",
pch = 16, col = "#00009010")
quantile(slopes, c(.025, .975))
加法2:
https://www.investopedia.com/terms/s/standard-error.asp 状态:
The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample population.
所以斜率的标准误差是我们 bootstrapped 斜率的标准差:
> sd(slopes)
[1] 0.02781162
粗略地说,这应该是 95%-CI 宽度的 4 倍:
> sd(slopes)
[1] 0.02781162
> abs(2.95 - 3.05)/4
[1] 0.025