了解 R 中线性回归的数据特征 - 绘制回归线上的数据分布
Understanding Data Characteristics for Linear Regression in R - Plotting Data Distribution Over the Regression Line
我试图了解在解决回归问题时如何理解数据的某些属性。具体来说,我希望看到数据 (y) 的分布在回归变量 (x) 的给定值处表征为正态分布,然后用数据和回归线绘制此正态分布(旋转 90 度)。
这就是我正在努力解决问题的方法(这段代码工作正常):
library(BAS) # for data
x <- bodyfat$Abdomen
y <- bodyfat$Bodyfat
dat <- data.frame(cbind(x, y))
# Linear model
fat.mod <- lm(y ~ x, data = dat)
# Plot of linear model and data
g <- ggplot(bodyfat, aes(x = Abdomen, y = Bodyfat)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
g
我想看到的是这样的图像:,但是对于我可以指定的 x 值(也许在 x 中有一些分布?)。通过绘图,我想查看叠加分布的特征(均值和标准差或方差)。假设数据关于回归线呈正态分布是可以的。
我真正崩溃的地方是如果我指定了一个数据中没有明确的点(例如,平均值)。
对此有什么想法吗?
非常感谢!
这真的很重要。这是解决方案(显然取自 Kurz 博士的工作 在 brms 和 tidyverse 中进行贝叶斯数据分析)
library(tidyverse)
# Draws per panel
n_draw <- 500
d <-
data.frame(panel = rep(letters[1:2],
each = n_draw),
x = c(runif(n = n_draw, min = -10, max = 10),
rnorm(n = n_draw / 2, mean = -7, sd = 2),
rnorm(n = n_draw / 2, mean = 3, sd = 2))) %>%
mutate(y = 10 + 2 * x + rnorm(n = n(),
mean = 0, sd = 2))
为旋转高斯创建一个单独的数据框:
# Define the x values from which the normal curves come
curves <- data.frame(x = seq(from = -7.5, to = 7.5,
length.out = 4)) %>%
# Use a linear relation (10 + 2x here) to compute an expected y for x
mutate(y_mean = 10 + (2 * x)) %>%
# Based on a normal distribution with mean `y_mean` and a standard deviation of 2, compute the 95% intervals
mutate(ll = qnorm(0.025, mean = y_mean, sd = 2),
ul = qnorm(0.975, mean = y_mean, sd = 2)) %>%
# Use the interval to make a series of y values
mutate(y = map2(ll, ul, seq, length.out = 100)) %>%
# This must be `unnest()`ed
unnest(y) %>%
# Calculate density values
mutate(density = map2_dbl(y, y_mean, dnorm,
sd = 2)) %>%
# Rescale densities wider; redefine the x column
mutate(x = x - density * 2 / max(density))
然后剧情:
d %>% ggplot(aes(x = x, y = y)) -> g
g <- g + geom_point(size = 1/3, alpha = 1/3)
g <- g + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE,
color = "red", linetype = 2)
g <- g + geom_path(data = curves, aes(group = y_mean),
size = 1, color = "blue")
g <- g + coord_cartesian(xlim = c(-10, 10),
ylim = c(-10, 30))
g <- g + theme(strip.background = element_blank(),
strip.text = element_blank())
g
生成的图形是:
我试图了解在解决回归问题时如何理解数据的某些属性。具体来说,我希望看到数据 (y) 的分布在回归变量 (x) 的给定值处表征为正态分布,然后用数据和回归线绘制此正态分布(旋转 90 度)。
这就是我正在努力解决问题的方法(这段代码工作正常):
library(BAS) # for data
x <- bodyfat$Abdomen
y <- bodyfat$Bodyfat
dat <- data.frame(cbind(x, y))
# Linear model
fat.mod <- lm(y ~ x, data = dat)
# Plot of linear model and data
g <- ggplot(bodyfat, aes(x = Abdomen, y = Bodyfat)) + geom_point() +
geom_smooth(method = "lm", se = FALSE)
g
我想看到的是这样的图像:
我真正崩溃的地方是如果我指定了一个数据中没有明确的点(例如,平均值)。
对此有什么想法吗?
非常感谢!
这真的很重要。这是解决方案(显然取自 Kurz 博士的工作 在 brms 和 tidyverse 中进行贝叶斯数据分析)
library(tidyverse)
# Draws per panel
n_draw <- 500
d <-
data.frame(panel = rep(letters[1:2],
each = n_draw),
x = c(runif(n = n_draw, min = -10, max = 10),
rnorm(n = n_draw / 2, mean = -7, sd = 2),
rnorm(n = n_draw / 2, mean = 3, sd = 2))) %>%
mutate(y = 10 + 2 * x + rnorm(n = n(),
mean = 0, sd = 2))
为旋转高斯创建一个单独的数据框:
# Define the x values from which the normal curves come
curves <- data.frame(x = seq(from = -7.5, to = 7.5,
length.out = 4)) %>%
# Use a linear relation (10 + 2x here) to compute an expected y for x
mutate(y_mean = 10 + (2 * x)) %>%
# Based on a normal distribution with mean `y_mean` and a standard deviation of 2, compute the 95% intervals
mutate(ll = qnorm(0.025, mean = y_mean, sd = 2),
ul = qnorm(0.975, mean = y_mean, sd = 2)) %>%
# Use the interval to make a series of y values
mutate(y = map2(ll, ul, seq, length.out = 100)) %>%
# This must be `unnest()`ed
unnest(y) %>%
# Calculate density values
mutate(density = map2_dbl(y, y_mean, dnorm,
sd = 2)) %>%
# Rescale densities wider; redefine the x column
mutate(x = x - density * 2 / max(density))
然后剧情:
d %>% ggplot(aes(x = x, y = y)) -> g
g <- g + geom_point(size = 1/3, alpha = 1/3)
g <- g + stat_smooth(method = "lm", se = FALSE, fullrange = TRUE,
color = "red", linetype = 2)
g <- g + geom_path(data = curves, aes(group = y_mean),
size = 1, color = "blue")
g <- g + coord_cartesian(xlim = c(-10, 10),
ylim = c(-10, 30))
g <- g + theme(strip.background = element_blank(),
strip.text = element_blank())
g
生成的图形是: