了解 R 中线性回归的数据特征 - 绘制回归线上的数据分布

Question

我试图了解在解决回归问题时如何理解数据的某些属性。具体来说，我希望看到数据 (y) 的分布在回归变量 (x) 的给定值处表征为正态分布，然后用数据和回归线绘制此正态分布（旋转 90 度）。

这就是我正在努力解决问题的方法（这段代码工作正常）：

library(BAS)  # for data
x <- bodyfat$Abdomen
y <- bodyfat$Bodyfat
dat <- data.frame(cbind(x, y))

# Linear model
fat.mod <- lm(y ~ x, data = dat)

# Plot of linear model and data
g <- ggplot(bodyfat, aes(x = Abdomen, y = Bodyfat)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
g

我想看到的是这样的图像：，但是对于我可以指定的 x 值（也许在 x 中有一些分布？）。通过绘图，我想查看叠加分布的特征（均值和标准差或方差）。假设数据关于回归线呈正态分布是可以的。

我真正崩溃的地方是如果我指定了一个数据中没有明确的点（例如，平均值）。

对此有什么想法吗？

非常感谢！

Answer 1

这真的很重要。这是解决方案（显然取自 Kurz 博士的工作 在 brms 和 tidyverse 中进行贝叶斯数据分析）

library(tidyverse)

# Draws per panel
n_draw <- 500

d <-
  data.frame(panel = rep(letters[1:2], 
     each = n_draw),
     x = c(runif(n = n_draw, min = -10, max = 10), 
     rnorm(n = n_draw / 2, mean = -7, sd = 2), 
     rnorm(n = n_draw / 2,  mean = 3, sd = 2))) %>% 
          mutate(y = 10 + 2 * x + rnorm(n = n(), 
          mean = 0, sd = 2))

为旋转高斯创建一个单独的数据框：

# Define the x values from which the normal curves come
curves <- data.frame(x = seq(from = -7.5, to = 7.5, 
     length.out = 4)) %>%
  
# Use a linear relation (10 + 2x here) to compute an expected y for x
     mutate(y_mean = 10 + (2 * x)) %>%
  
# Based on a normal distribution with mean `y_mean` and a standard deviation of 2, compute the 95% intervals
     mutate(ll = qnorm(0.025, mean = y_mean, sd = 2), 
          ul = qnorm(0.975, mean = y_mean, sd = 2)) %>%
  
# Use the interval to make a series of y values
     mutate(y = map2(ll, ul, seq, length.out = 100)) %>%
  
# This must be `unnest()`ed
     unnest(y) %>%
  
# Calculate density values
     mutate(density = map2_dbl(y, y_mean, dnorm, 
          sd = 2)) %>%
  
# Rescale densities wider; redefine the x column 
     mutate(x = x - density * 2 / max(density))

然后剧情：

d %>% ggplot(aes(x = x, y = y)) -> g
g <- g + geom_point(size = 1/3, alpha = 1/3)
g <- g + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE, 
        color = "red", linetype = 2)
g <- g + geom_path(data = curves, aes(group = y_mean),
            size = 1, color = "blue") 
g <- g + coord_cartesian(xlim = c(-10, 10),
        ylim = c(-10, 30)) 
g <- g + theme(strip.background = element_blank(),
        strip.text = element_blank()) 
g

生成的图形是：

了解 R 中线性回归的数据特征 - 绘制回归线上的数据分布

Understanding Data Characteristics for Linear Regression in R - Plotting Data Distribution Over the Regression Line

r

data-visualization

linear-regression